Thanks for your code Njuffa.
At [url]http://www.orangeowlsolutions.com/archives/256[/url] you could find our original CUDA code for the calculation of I0 according to Blair's algorithm. Of course, our code could be improved by using fma's, as you already do.
Did you further optimize the polynomial coefficients by minimax? Against which reference did you perform the minimax optimization and you made the assessment of the results of your routine?

At http://www.orangeowlsolutions.com/archives/256 you could find our original CUDA code for the calculation of I0 according to Blair's algorithm. Of course, our code could be improved by using fma's, as you already do.

Did you further optimize the polynomial coefficients by minimax? Against which reference did you perform the minimax optimization and you made the assessment of the results of your routine?

I kept Blair's original algorithmic framework, but generated new minimax approximations from scratch, using the Remez algorithm, operating with 1024-bit arithmetic for all intermediate computations. Relative error of both approximations was reported to be < 1e-17, the error peaks were leveled to less than 1e-33. For multi-precision arithmetic, I use Brent's MP code ([url]http://www.netlib.org/toms/524[/url]) which I have used for many years and consider reliable.
I used the series 10.25.2 in the DLMF ([url]http://dlmf.nist.gov/10.25#E2[/url]) as my reference, using the first 10*k+100 terms, where k is the argument to i0().
I kicked off a more intensive test with 100 million test cases last night to get a more accurate reading of the maximum ulp error for the posted code. All the worst case errors occur for arguments in [10,15]. I will report here the worst case ulp error found and the corresponding argument once the run is done.

I kept Blair's original algorithmic framework, but generated new minimax approximations from scratch, using the Remez algorithm, operating with 1024-bit arithmetic for all intermediate computations. Relative error of both approximations was reported to be < 1e-17, the error peaks were leveled to less than 1e-33. For multi-precision arithmetic, I use Brent's MP code (http://www.netlib.org/toms/524) which I have used for many years and consider reliable.

I used the series 10.25.2 in the DLMF (http://dlmf.nist.gov/10.25#E2) as my reference, using the first 10*k+100 terms, where k is the argument to i0().

I kicked off a more intensive test with 100 million test cases last night to get a more accurate reading of the maximum ulp error for the posted code. All the worst case errors occur for arguments in [10,15]. I will report here the worst case ulp error found and the corresponding argument once the run is done.

My test run finally finished using 100 million test case in [1e-3, 713]. Reference was implemented from the series mentioned above, using the first 1.25*k+35 terms [where k is the input to the i0() function], evaluated in 140-bit arithmetic. This provides slightly more than quadruple precision reference results. The worst case error found was 9.20648 ulps for input of 1.2558077675329185e+01. About 71% of results are rounded correctly.
The error of my rational approximation is higher than I would like. Comparing to SFSebastian's code based on Blair's paper, I notice he used the rational approximatiom of degree 16 + 3 from table VIII, listed as providing a precision of 18.38, more than an order of magnitude order better than my rational approximation of degree 12 + 5. Looks like I should try a rational approximation of higher degree.
SFSebastian, I noticed a few minor issue with the code on your website, that may or may not be worth fixing. Larger negative arguments are not handled correctly, there is a missing fabs(). The function overflows prematurely for arguments greater than about 709.7 in magnitude, this could be avoided by multiplying twice by exp(0.5). Specifying the coefficient as products of two double-precision numbers is likely to introduce additional rounding error as these products are not exactly representable as double-precision numbers; it would be better to specify the coefficients as a single literal. For the code branch dealing with arguments greater than 15, I would suggest multiplying by rsqrt() instead of dividing by sqrt(). While CUDA's double-precision rsqrt() is not IEEE-rounded its maximum ulp error is very close to 0.5, so the substitution should improve performance at negligible impact to accuracy.
While I don't have the time to run full test run with 100 million test cases against this code, based on preliminary testing it seems the the maximum ulp error could be around 8.5 ulps, with about 55% of results correctly rounded.

My test run finally finished using 100 million test case in [1e-3, 713]. Reference was implemented from the series mentioned above, using the first 1.25*k+35 terms [where k is the input to the i0() function], evaluated in 140-bit arithmetic. This provides slightly more than quadruple precision reference results. The worst case error found was 9.20648 ulps for input of 1.2558077675329185e+01. About 71% of results are rounded correctly.

The error of my rational approximation is higher than I would like. Comparing to SFSebastian's code based on Blair's paper, I notice he used the rational approximatiom of degree 16 + 3 from table VIII, listed as providing a precision of 18.38, more than an order of magnitude order better than my rational approximation of degree 12 + 5. Looks like I should try a rational approximation of higher degree.

SFSebastian, I noticed a few minor issue with the code on your website, that may or may not be worth fixing. Larger negative arguments are not handled correctly, there is a missing fabs(). The function overflows prematurely for arguments greater than about 709.7 in magnitude, this could be avoided by multiplying twice by exp(0.5). Specifying the coefficient as products of two double-precision numbers is likely to introduce additional rounding error as these products are not exactly representable as double-precision numbers; it would be better to specify the coefficients as a single literal. For the code branch dealing with arguments greater than 15, I would suggest multiplying by rsqrt() instead of dividing by sqrt(). While CUDA's double-precision rsqrt() is not IEEE-rounded its maximum ulp error is very close to 0.5, so the substitution should improve performance at negligible impact to accuracy.

While I don't have the time to run full test run with 100 million test cases against this code, based on preliminary testing it seems the the maximum ulp error could be around 8.5 ulps, with about 55% of results correctly rounded.

Thank you very much, Njuffa.
Below, please find attached the version of the code revised according to your recommendations:
[code]
__device__ double bessi0(double x)
{
double num, den, x2, y, tn_2, tn_1, tn, tn_1_old;
x2 = abs(x*x);
if (abs(x)

Thank you very much, Njuffa.
Below, please find attached the version of the code revised according to your recommendations:
[code]
__device__ double bessi0(double x)
{
double num, den, x2, y, tn_2, tn_1, tn, tn_1_old;
x2 = abs(x*x);
if (abs(x)

Thank you very much, Njuffa.
Below, please find attached the version of the code revised according to your recommendations:
[code]
__device__ double bessi0(double x)
{
double num, den, x2, y, tn_2, tn_1, tn, tn_1_old;
x2 = abs(x*x);
if (abs(x) <= 15.0)
{
num = -0.27288446572737951578789523409E+010;
num = fma (num, x2, -0.6768549084673824894340380223E+009);
num = fma (num, x2, -0.4130296432630476829274339869E+008);
num = fma (num, x2, -0.11016595146164611763171787004E+007);
num = fma (num, x2, -0.1624100026427837007503320319E+005);
num = fma (num, x2, -0.1503841142335444405893518061E+003);
num = fma (num, x2, -0.947449149975326604416967031E+000);
num = fma (num, x2, -0.4287350374762007105516581810E-002);
num = fma (num, x2, -0.1447896113298369009581404138E-004);
num = fma (num, x2, -0.375114023744978945259642850E-007);
num = fma (num, x2, -0.760147559624348256501094832E-010);
num = fma (num, x2, -0.121992831543841162565677055E-012);
num = fma (num, x2, -0.15587387207852991014838679E-015);
num = fma (num, x2, -0.15795544211478823152992269E-018);
num = fma (num, x2, -0.1247819710175804058844059E-021);
num = fma (num, x2, -0.72585406935875957424755E-025);
num = fma (num, x2, -0.28840544803647313855232E-028);
den = -0.2728844657273795156746641315E+010;
den = fma (den, x2, 0.5356255851066290475987259E+007);
den = fma (den, x2, -0.38305191682802536272760E+004);
den = fma (den, x2, 0.1E+001);
return num/den;
}
else
{
y=30./abs(x)-1.;
num=0.;
tn_2=1.;
num=cp[0]*tn_2;
tn_1=y;
num=num+cp[1]*tn_1;
for (int k=2; k<=25; k++)
{
tn=2.*y*tn_1-tn_2;
num=num+cp[k]*tn;
tn_1_old=tn_1;
tn_1=tn;
tn_2=tn_1_old;
}
return num*exp(0.5*x)*exp(0.5*x)*rsqrt(x);
}
}
[/code]
As you can see, I have also used fma's, in an attempt to make the code faster. I tried also to apply the same tool to the calculation of the Chebyshev polynomials (|x|>15), but, suprisingly, the code appears slower.
Do you have any recommendation on how speeding up the function, as my overall application seems to be limited by the calculation of I0?
Thank you very much in advance.

As you can see, I have also used fma's, in an attempt to make the code faster. I tried also to apply the same tool to the calculation of the Chebyshev polynomials (|x|>15), but, suprisingly, the code appears slower.

Do you have any recommendation on how speeding up the function, as my overall application seems to be limited by the calculation of I0?

Actually, it is not necessary to code FMAs by hand, the compiler usually does a very good job of finding opportunities to use FMA, and a quick look at the SASS (machine code) produced for the code you posted to your website shows that the maximum number of FMAs were generated. I just have a habit of hand-coding the FMAs because the code in the CUDA math library (a collection of header files) must be protected from a programmer's use of the compiler flag -fmad=false, which turns off the merging of FMUL and FADD into FMA. In various instances use of FMA is crucial to the functional correctness of that code.
I would recommend coding in a natural style and let the compiler work its magic. You can examine the generated code by running cuobjdump --dump-sass on the generated binary. While you may not want to observe the intricacies of the machine code, one easily observes a large number of DFMAs, and a very small number of DMUL and DADD instructions in the generated code.
In terms of performance, what is the distribution of input arguments in your app? If many arguments have a magnitude greater than 15, I would recommend replacing the Chebyshev evaluation with the straight polynomial I used. Blair suggests in the paper that this causes issues with accuracy, but I have found no evidence of that, in fact the accuracy in that regions seems to be excellent. If most of your inputs hit the rational approximation for the interval [-15,15], I currently don't see much room for speedup. I tried to make that code faster by choosing the lowest degree approximation possible within a double-precision target, however as previously noted this already seems to have increased the maximum ulp error slightly. You could replace the division with a reciprocal followed by back-multiply for a tiny performance boost, at some small loss in accuracy.
A bigger issue may be branch divergence caused by a mix of operands both smaller and larger in magnitude than 15. If your app can guarantee that the inputs are only in a certain range, one could design the switchover point such that all (or the vast majority) of arguments fall into a single code branch to minimize this effect. However, playing a little bit with this, there seems to be limited wiggle room here. Reasonable switchover points seems to be in the region between 12 and 16.
There may be better ways to evaluate the function in the interval [-15,15] but it would take weeks of investigation to find out what approach works best in terms of accuracy and performance. I went through such an exercise with the double-precision erfc() function in the CUDA math library, which (if I recall correctly) I rewrote three times over the years, not counting smaller modifications. I think it was Cleve Moler who noted that in terms of the implementation of special functions, much scientific code is still based on work done in the 1970s, or even the 1960s. There seems to be little active research in that area in general, and into the adaptation to modern architectures in particular (segmented approximations can perform quite poorly on GPUs due to divergence, for example).

Actually, it is not necessary to code FMAs by hand, the compiler usually does a very good job of finding opportunities to use FMA, and a quick look at the SASS (machine code) produced for the code you posted to your website shows that the maximum number of FMAs were generated. I just have a habit of hand-coding the FMAs because the code in the CUDA math library (a collection of header files) must be protected from a programmer's use of the compiler flag -fmad=false, which turns off the merging of FMUL and FADD into FMA. In various instances use of FMA is crucial to the functional correctness of that code.

I would recommend coding in a natural style and let the compiler work its magic. You can examine the generated code by running cuobjdump --dump-sass on the generated binary. While you may not want to observe the intricacies of the machine code, one easily observes a large number of DFMAs, and a very small number of DMUL and DADD instructions in the generated code.

In terms of performance, what is the distribution of input arguments in your app? If many arguments have a magnitude greater than 15, I would recommend replacing the Chebyshev evaluation with the straight polynomial I used. Blair suggests in the paper that this causes issues with accuracy, but I have found no evidence of that, in fact the accuracy in that regions seems to be excellent. If most of your inputs hit the rational approximation for the interval [-15,15], I currently don't see much room for speedup. I tried to make that code faster by choosing the lowest degree approximation possible within a double-precision target, however as previously noted this already seems to have increased the maximum ulp error slightly. You could replace the division with a reciprocal followed by back-multiply for a tiny performance boost, at some small loss in accuracy.

A bigger issue may be branch divergence caused by a mix of operands both smaller and larger in magnitude than 15. If your app can guarantee that the inputs are only in a certain range, one could design the switchover point such that all (or the vast majority) of arguments fall into a single code branch to minimize this effect. However, playing a little bit with this, there seems to be limited wiggle room here. Reasonable switchover points seems to be in the region between 12 and 16.

There may be better ways to evaluate the function in the interval [-15,15] but it would take weeks of investigation to find out what approach works best in terms of accuracy and performance. I went through such an exercise with the double-precision erfc() function in the CUDA math library, which (if I recall correctly) I rewrote three times over the years, not counting smaller modifications. I think it was Cleve Moler who noted that in terms of the implementation of special functions, much scientific code is still based on work done in the 1970s, or even the 1960s. There seems to be little active research in that area in general, and into the adaptation to modern architectures in particular (segmented approximations can perform quite poorly on GPUs due to divergence, for example).

For more performance in the interval [-15,15], one could use the following polynomial approximation. It provides excellent average error, but the maximum ulp error seems to be a bit higher than with a rational approximation (from preliminary testing it looks like it could be 10 ulps). This is not very surprising. The elimination of the division reduces rounding error, but the basic numerical properties of the polynomial approximation over this wide interval are even dicier than those of the rational approximation (all coefficients but one are positive, so there is accumulated error from the truncation of the coefficients, and the terms are not added in strictly increasing order of magnitude for many inputs).
[code]
a = a * a;
p = 8.1431960134340991E-046;
p = fma (p, a, -2.4111163556938050E-043);
p = fma (p, a, 8.0484586430788548E-040);
p = fma (p, a, 4.0212015993738263E-037);
p = fma (p, a, 5.7880053859816979E-034);
p = fma (p, a, 4.8362662530453255E-031);
p = fma (p, a, 3.8521770404434839E-028);
p = fma (p, a, 2.5968203434295812E-025);
p = fma (p, a, 1.4964121357322659E-022);
p = fma (p, a, 7.2422128733687263E-020);
p = fma (p, a, 2.8969053470393839E-017);
p = fma (p, a, 9.3859663916487907E-015);
p = fma (p, a, 2.4028075622089642E-012);
p = fma (p, a, 4.7095027952895302E-010);
p = fma (p, a, 6.7816840279328497E-008);
p = fma (p, a, 6.7816840277700736E-006);
p = fma (p, a, 4.3402777777779694E-004);
p = fma (p, a, 1.5624999999999981E-002);
p = fma (p, a, 2.5000000000000000E-001);
p = fma (p, a, 1.0000000000000000E+000);
[/code]

For more performance in the interval [-15,15], one could use the following polynomial approximation. It provides excellent average error, but the maximum ulp error seems to be a bit higher than with a rational approximation (from preliminary testing it looks like it could be 10 ulps). This is not very surprising. The elimination of the division reduces rounding error, but the basic numerical properties of the polynomial approximation over this wide interval are even dicier than those of the rational approximation (all coefficients but one are positive, so there is accumulated error from the truncation of the coefficients, and the terms are not added in strictly increasing order of magnitude for many inputs).

a = a * a;
p = 8.1431960134340991E-046;
p = fma (p, a, -2.4111163556938050E-043);
p = fma (p, a, 8.0484586430788548E-040);
p = fma (p, a, 4.0212015993738263E-037);
p = fma (p, a, 5.7880053859816979E-034);
p = fma (p, a, 4.8362662530453255E-031);
p = fma (p, a, 3.8521770404434839E-028);
p = fma (p, a, 2.5968203434295812E-025);
p = fma (p, a, 1.4964121357322659E-022);
p = fma (p, a, 7.2422128733687263E-020);
p = fma (p, a, 2.8969053470393839E-017);
p = fma (p, a, 9.3859663916487907E-015);
p = fma (p, a, 2.4028075622089642E-012);
p = fma (p, a, 4.7095027952895302E-010);
p = fma (p, a, 6.7816840279328497E-008);
p = fma (p, a, 6.7816840277700736E-006);
p = fma (p, a, 4.3402777777779694E-004);
p = fma (p, a, 1.5624999999999981E-002);
p = fma (p, a, 2.5000000000000000E-001);
p = fma (p, a, 1.0000000000000000E+000);

I plugged the above polynomial approximation for the interval [-15,15] into the code I posted previously and re-ran my accuracy check. With 100 million test inputs over the interval [1e-3,714], 72% of results were correctly rounded. Maximum ulp error observed was 9.43816 for an input of 11.843115728810956. On a K20c the observed throughput was 5+ billion function evaluations per second.

I plugged the above polynomial approximation for the interval [-15,15] into the code I posted previously and re-ran my accuracy check. With 100 million test inputs over the interval [1e-3,714], 72% of results were correctly rounded. Maximum ulp error observed was 9.43816 for an input of 11.843115728810956. On a K20c the observed throughput was 5+ billion function evaluations per second.

As always, thank you very much for your deep, detailed and competent comments, for the codes shred and for the time spent in assessing the accuracy performance.
Driven by your input, I discovered that, for my application, the input for I0 is always greater than 26.
So, and based on your suggestions, I changed the |x|>15 branch of my code with the one you originally posted and I experienced a speedup of about 1.5 (for a certain test case). If I understand correctly, the |x|>15 branch you initially posted is a straight rational polynomial expression for the rational Chebyshev polynomial expression of my original code, right?
Again, thank you very much.

As always, thank you very much for your deep, detailed and competent comments, for the codes shred and for the time spent in assessing the accuracy performance.

Driven by your input, I discovered that, for my application, the input for I0 is always greater than 26.

So, and based on your suggestions, I changed the |x|>15 branch of my code with the one you originally posted and I experienced a speedup of about 1.5 (for a certain test case). If I understand correctly, the |x|>15 branch you initially posted is a straight rational polynomial expression for the rational Chebyshev polynomial expression of my original code, right?

Here is the final code I'm using now:
[code]
__device__ double bessi0(double x)
{
// -- See paper
// J.M. Blair, "Rational Chebyshev approximations for the modified Bessel functions I_0(x) and I_1(x)", Math. Comput., vol. 28, n. 126, pp. 581-583, Apr. 1974.
double num, den, x2;
x2 = abs(x*x);
x=abs(x);
if (x > 15.0)
{
den = 1.0 / x;
num = -4.4979236558557991E+006;
num = fma (num, den, 2.7472555659426521E+006);
num = fma (num, den, -6.4572046640793153E+005);
num = fma (num, den, 8.5476214845610564E+004);
num = fma (num, den, -7.1127665397362362E+003);
num = fma (num, den, 4.1710918140001479E+002);
num = fma (num, den, -1.3787683843558749E+001);
num = fma (num, den, 1.1452802345029696E+000);
num = fma (num, den, 2.1935487807470277E-001);
num = fma (num, den, 9.0727240339987830E-002);
num = fma (num, den, 4.4741066428061006E-002);
num = fma (num, den, 2.9219412078729436E-002);
num = fma (num, den, 2.8050629067165909E-002);
num = fma (num, den, 4.9867785050221047E-002);
num = fma (num, den, 3.9894228040143265E-001);
num = num * den;
den = sqrt (x);
num = num * den;
den = exp (0.5 * x); /* prevent premature overflow */
num = num * den;
num = num * den;
return num;
}
else
{
num = -0.28840544803647313855232E-028;
num = fma (num, x2, -0.72585406935875957424755E-025);
num = fma (num, x2, -0.1247819710175804058844059E-021);
num = fma (num, x2, -0.15795544211478823152992269E-018);
num = fma (num, x2, -0.15587387207852991014838679E-015);
num = fma (num, x2, -0.121992831543841162565677055E-012);
num = fma (num, x2, -0.760147559624348256501094832E-010);
num = fma (num, x2, -0.375114023744978945259642850E-007);
num = fma (num, x2, -0.1447896113298369009581404138E-004);
num = fma (num, x2, -0.4287350374762007105516581810E-002);
num = fma (num, x2, -0.947449149975326604416967031E+000);
num = fma (num, x2, -0.1503841142335444405893518061E+003);
num = fma (num, x2, -0.1624100026427837007503320319E+005);
num = fma (num, x2, -0.11016595146164611763171787004E+007);
num = fma (num, x2, -0.4130296432630476829274339869E+008);
num = fma (num, x2, -0.6768549084673824894340380223E+009);
num = fma (num, x2, -0.27288446572737951578789523409E+010);
den = 0.1E+001;
den = fma (den, x2, -0.38305191682802536272760E+004);
den = fma (den, x2, 0.5356255851066290475987259E+007);
den = fma (den, x2, -0.2728844657273795156746641315E+010);
return num/den;
}
}
[/code]
The first part is your originally posted code, the second part is my original code.

__device__ double bessi0(double x)
{
// -- See paper
// J.M. Blair, "Rational Chebyshev approximations for the modified Bessel functions I_0(x) and I_1(x)", Math. Comput., vol. 28, n. 126, pp. 581-583, Apr. 1974.

double num, den, x2;

x2 = abs(x*x);

x=abs(x);

if (x > 15.0)
{
den = 1.0 / x;
num = -4.4979236558557991E+006;
num = fma (num, den, 2.7472555659426521E+006);
num = fma (num, den, -6.4572046640793153E+005);
num = fma (num, den, 8.5476214845610564E+004);
num = fma (num, den, -7.1127665397362362E+003);
num = fma (num, den, 4.1710918140001479E+002);
num = fma (num, den, -1.3787683843558749E+001);
num = fma (num, den, 1.1452802345029696E+000);
num = fma (num, den, 2.1935487807470277E-001);
num = fma (num, den, 9.0727240339987830E-002);
num = fma (num, den, 4.4741066428061006E-002);
num = fma (num, den, 2.9219412078729436E-002);
num = fma (num, den, 2.8050629067165909E-002);
num = fma (num, den, 4.9867785050221047E-002);
num = fma (num, den, 3.9894228040143265E-001);
num = num * den;
den = sqrt (x);
num = num * den;
den = exp (0.5 * x); /* prevent premature overflow */
num = num * den;
num = num * den;
return num;
}
else
{
num = -0.28840544803647313855232E-028;
num = fma (num, x2, -0.72585406935875957424755E-025);
num = fma (num, x2, -0.1247819710175804058844059E-021);
num = fma (num, x2, -0.15795544211478823152992269E-018);
num = fma (num, x2, -0.15587387207852991014838679E-015);
num = fma (num, x2, -0.121992831543841162565677055E-012);
num = fma (num, x2, -0.760147559624348256501094832E-010);
num = fma (num, x2, -0.375114023744978945259642850E-007);
num = fma (num, x2, -0.1447896113298369009581404138E-004);
num = fma (num, x2, -0.4287350374762007105516581810E-002);
num = fma (num, x2, -0.947449149975326604416967031E+000);
num = fma (num, x2, -0.1503841142335444405893518061E+003);
num = fma (num, x2, -0.1624100026427837007503320319E+005);
num = fma (num, x2, -0.11016595146164611763171787004E+007);
num = fma (num, x2, -0.4130296432630476829274339869E+008);
num = fma (num, x2, -0.6768549084673824894340380223E+009);
num = fma (num, x2, -0.27288446572737951578789523409E+010);

den = 0.1E+001;
den = fma (den, x2, -0.38305191682802536272760E+004);
den = fma (den, x2, 0.5356255851066290475987259E+007);
den = fma (den, x2, -0.2728844657273795156746641315E+010);

return num/den;
}
}

The first part is your originally posted code, the second part is my original code.

After discovering that Blair stated approximations for |x| > 15 in terms of unexpanded Chebyshev polynomials (which, as you can see, require two FMAs per coefficient to evaluate, versus one FMA per coefficient for ordinary polynomials), I decided to go straight to my own custom rational approximation for improved effiency. I did not attempt to convert Blair's Chebyshev approximations into the equivalent plain polynomials, which obviously is not hard to do.
After looking at the accuracy of rational approximations with polynomials of decreasing degree in the denominator, it seemed to me that a simple polynomial approximations should work just as well. On the first try I got something that appeared to have an error of about 5 ulps, including the error contributed by sqrt() and exp(), and this looked good enough to me, so I posted it. With a couple of weeks of work, one could certainly fine tune the approximations for both intervals to further improve the accuracy. It may also be advantageous to move the switchover point down from 15 to maybe 12 or so to get aproximations with similar accuracy across both intervals; right now there is a marked imbalance. The development process for this function is a bit slow since generating and testing each new approximation takes many hours, given that my high-accuracy reference function is slow. It really puts the brakes on my process of experimental exploration.

After discovering that Blair stated approximations for |x| > 15 in terms of unexpanded Chebyshev polynomials (which, as you can see, require two FMAs per coefficient to evaluate, versus one FMA per coefficient for ordinary polynomials), I decided to go straight to my own custom rational approximation for improved effiency. I did not attempt to convert Blair's Chebyshev approximations into the equivalent plain polynomials, which obviously is not hard to do.

After looking at the accuracy of rational approximations with polynomials of decreasing degree in the denominator, it seemed to me that a simple polynomial approximations should work just as well. On the first try I got something that appeared to have an error of about 5 ulps, including the error contributed by sqrt() and exp(), and this looked good enough to me, so I posted it. With a couple of weeks of work, one could certainly fine tune the approximations for both intervals to further improve the accuracy. It may also be advantageous to move the switchover point down from 15 to maybe 12 or so to get aproximations with similar accuracy across both intervals; right now there is a marked imbalance. The development process for this function is a bit slow since generating and testing each new approximation takes many hours, given that my high-accuracy reference function is slow. It really puts the brakes on my process of experimental exploration.

For CUDA 6.0 functions cyl_bessel_i0{f} and cyl_bessel_i1{f} were added to the CUDA math library. These compute the modified Bessel functions of the first kind of order 0 and 1, i.e. I0(x) and I1(x).

For CUDA 6.0 functions cyl_bessel_i0{f} and cyl_bessel_i1{f} were added to the CUDA math library. These compute the modified Bessel functions of the first kind of order 0 and 1, i.e. I0(x) and I1(x).

That's a whole truckload of awesome right there. I use the Kaiser Bessel window for imaging applications, and I'm curious if it's faster to use a pre-calculated 2D window from global memory or if I should just calculate it on the fly as I'm de-meadning the image to avoid the reads associated with bringing it out of global memory. I don't need double precision here; is there a single precision version in 6.0 (I'm still down at 5.0/5.5)?

That's a whole truckload of awesome right there. I use the Kaiser Bessel window for imaging applications, and I'm curious if it's faster to use a pre-calculated 2D window from global memory or if I should just calculate it on the fly as I'm de-meadning the image to avoid the reads associated with bringing it out of global memory. I don't need double precision here; is there a single precision version in 6.0 (I'm still down at 5.0/5.5)?

Both single and double precision versions of the modified Bessel functions of the first kind (orders 0 and 1) are supplied in CUDA 6.0. The single precision functions are cyl_bessel_i0f() and cyl_bessel_i1f().
As for the tradeoffs with pre-calculated data, it would probably be simplest to try it both ways. As the implementations of these functions use simple polynomial approximations they are reasonably fast.

Both single and double precision versions of the modified Bessel functions of the first kind (orders 0 and 1) are supplied in CUDA 6.0. The single precision functions are cyl_bessel_i0f() and cyl_bessel_i1f().

As for the tradeoffs with pre-calculated data, it would probably be simplest to try it both ways. As the implementations of these functions use simple polynomial approximations they are reasonably fast.

At http://www.orangeowlsolutions.com/archives/256 you could find our original CUDA code for the calculation of I0 according to Blair's algorithm. Of course, our code could be improved by using fma's, as you already do.

Did you further optimize the polynomial coefficients by minimax? Against which reference did you perform the minimax optimization and you made the assessment of the results of your routine?

wwww.orangeowlsolutions.com

I used the series 10.25.2 in the DLMF (http://dlmf.nist.gov/10.25#E2) as my reference, using the first 10*k+100 terms, where k is the argument to i0().

I kicked off a more intensive test with 100 million test cases last night to get a more accurate reading of the maximum ulp error for the posted code. All the worst case errors occur for arguments in [10,15]. I will report here the worst case ulp error found and the corresponding argument once the run is done.

The error of my rational approximation is higher than I would like. Comparing to SFSebastian's code based on Blair's paper, I notice he used the rational approximatiom of degree 16 + 3 from table VIII, listed as providing a precision of 18.38, more than an order of magnitude order better than my rational approximation of degree 12 + 5. Looks like I should try a rational approximation of higher degree.

SFSebastian, I noticed a few minor issue with the code on your website, that may or may not be worth fixing. Larger negative arguments are not handled correctly, there is a missing fabs(). The function overflows prematurely for arguments greater than about 709.7 in magnitude, this could be avoided by multiplying twice by exp(0.5). Specifying the coefficient as products of two double-precision numbers is likely to introduce additional rounding error as these products are not exactly representable as double-precision numbers; it would be better to specify the coefficients as a single literal. For the code branch dealing with arguments greater than 15, I would suggest multiplying by rsqrt() instead of dividing by sqrt(). While CUDA's double-precision rsqrt() is not IEEE-rounded its maximum ulp error is very close to 0.5, so the substitution should improve performance at negligible impact to accuracy.

While I don't have the time to run full test run with 100 million test cases against this code, based on preliminary testing it seems the the maximum ulp error could be around 8.5 ulps, with about 55% of results correctly rounded.

Below, please find attached the version of the code revised according to your recommendations:

[code]__device__ double bessi0(double x)

{

double num, den, x2, y, tn_2, tn_1, tn, tn_1_old;

x2 = abs(x*x);

if (abs(x)

wwww.orangeowlsolutions.com

Below, please find attached the version of the code revised according to your recommendations:

[code]__device__ double bessi0(double x)

{

double num, den, x2, y, tn_2, tn_1, tn, tn_1_old;

x2 = abs(x*x);

if (abs(x)

wwww.orangeowlsolutions.com

Below, please find attached the version of the code revised according to your recommendations:

As you can see, I have also used fma's, in an attempt to make the code faster. I tried also to apply the same tool to the calculation of the Chebyshev polynomials (|x|>15), but, suprisingly, the code appears slower.

Do you have any recommendation on how speeding up the function, as my overall application seems to be limited by the calculation of I0?

Thank you very much in advance.

wwww.orangeowlsolutions.com

I would recommend coding in a natural style and let the compiler work its magic. You can examine the generated code by running cuobjdump --dump-sass on the generated binary. While you may not want to observe the intricacies of the machine code, one easily observes a large number of DFMAs, and a very small number of DMUL and DADD instructions in the generated code.

In terms of performance, what is the distribution of input arguments in your app? If many arguments have a magnitude greater than 15, I would recommend replacing the Chebyshev evaluation with the straight polynomial I used. Blair suggests in the paper that this causes issues with accuracy, but I have found no evidence of that, in fact the accuracy in that regions seems to be excellent. If most of your inputs hit the rational approximation for the interval [-15,15], I currently don't see much room for speedup. I tried to make that code faster by choosing the lowest degree approximation possible within a double-precision target, however as previously noted this already seems to have increased the maximum ulp error slightly. You could replace the division with a reciprocal followed by back-multiply for a tiny performance boost, at some small loss in accuracy.

A bigger issue may be branch divergence caused by a mix of operands both smaller and larger in magnitude than 15. If your app can guarantee that the inputs are only in a certain range, one could design the switchover point such that all (or the vast majority) of arguments fall into a single code branch to minimize this effect. However, playing a little bit with this, there seems to be limited wiggle room here. Reasonable switchover points seems to be in the region between 12 and 16.

There may be better ways to evaluate the function in the interval [-15,15] but it would take weeks of investigation to find out what approach works best in terms of accuracy and performance. I went through such an exercise with the double-precision erfc() function in the CUDA math library, which (if I recall correctly) I rewrote three times over the years, not counting smaller modifications. I think it was Cleve Moler who noted that in terms of the implementation of special functions, much scientific code is still based on work done in the 1970s, or even the 1960s. There seems to be little active research in that area in general, and into the adaptation to modern architectures in particular (segmented approximations can perform quite poorly on GPUs due to divergence, for example).

Driven by your input, I discovered that, for my application, the input for I0 is always greater than 26.

So, and based on your suggestions, I changed the |x|>15 branch of my code with the one you originally posted and I experienced a speedup of about 1.5 (for a certain test case). If I understand correctly, the |x|>15 branch you initially posted is a straight rational polynomial expression for the rational Chebyshev polynomial expression of my original code, right?

Again, thank you very much.

wwww.orangeowlsolutions.com

The first part is your originally posted code, the second part is my original code.

wwww.orangeowlsolutions.com

After looking at the accuracy of rational approximations with polynomials of decreasing degree in the denominator, it seemed to me that a simple polynomial approximations should work just as well. On the first try I got something that appeared to have an error of about 5 ulps, including the error contributed by sqrt() and exp(), and this looked good enough to me, so I posted it. With a couple of weeks of work, one could certainly fine tune the approximations for both intervals to further improve the accuracy. It may also be advantageous to move the switchover point down from 15 to maybe 12 or so to get aproximations with similar accuracy across both intervals; right now there is a marked imbalance. The development process for this function is a bit slow since generating and testing each new approximation takes many hours, given that my high-accuracy reference function is slow. It really puts the brakes on my process of experimental exploration.

As for the tradeoffs with pre-calculated data, it would probably be simplest to try it both ways. As the implementations of these functions use simple polynomial approximations they are reasonably fast.