My inspection of glibc sources revealed that numerical functions provided by GNU glibc just only wraps implementation of ones by mathematical co-processor. No own rational approximation is implemented. The hardware support is directly assured by C99 IEEE standard.
Following the track, I created a small test code to test duration of execution various computing functions in GNU system math library. One is practically tests implementation of the functions in mathematical co-processors.
Fortran
program funs
integer, parameter :: double = selected_real_kind(15)
integer, parameter :: nsteps = 10000
integer :: i,j
real(double) :: x,y, s, xsteps = nsteps
s = 0.0
do j = 1,nsteps
do i = 1,nsteps
x = i/xsteps
y = x*(1.0_double/3.0_double)
s = s + y
end do
end do
write(*,*) s
end program funs
C
#include
int main()
{
const int nsteps = 10000;
int i,j;
double xsteps = (double) nsteps;
double x,y,s;
s = 0.0;
for(j = 1; j <= nsteps; j++)
for(i = 1; i <= nsteps; i++) {
x = (double) i /xsteps;
/*!*/ y = pow(x,1.0/3.0);
s += y;
}
return s > 0.0 ? 1 : 0;
}
(note that no-effect summations and return values are added to prevent compiler's over-optimizations)
The function in line /*!*/ have been changed.
Computations has been done on Intel(R) Core Quad and verified on Intel(R) Xeon machines (cca 30% faster). The gcc and gfortran compilers gives the same results, gfortran may give a litte bit faster code).
Results are summarized in following table.
Core Quad
[nsec] [nsec] ratio fortran i686
-O -O4 -O4
pow(x,1.0/3.0) 128.56 123.01 7.2 123.06 287.68
exp(1/3*log(x)) 120.21 116.14 6.8 117.14 269.20
cbtr(x) 90.15 85.79 5.0 - -
sinh(x) 95.15 91.80 5.4 91.99 235.34
log10(x) 95.74 91.14 5.4 91.73 95.35
atan2(x,2.0) 83.94 79.22 4.6 79.09 106.00
log(x) 79.30 74.98 4.4 74.88 95.22
cosh(x) 78.13 74.34 4.4 74.61 198.99
tan(x) 78.89 73.29 4.3 73.50 118.17
atan(x) 64.27 58.02 3.4 57.99 101.92
exp(x) 58.97 54.58 3.2 54.55 143.93
asin(x) 58.52 53.87 3.2 53.44 165.31
acos(x) 58.70 55.30 3.2 54.94 165.32
cos(x) 51.63 45.57 2.7 45.64 93.98
sin(x) 50.03 44.29 2.6 44.77 92.30
sqrt(x) 47.68 37.23 2.2 39.35 34.11
fabs(x) 20.93 16.14 1.0 16.11 17.25
(1.0/3.0)*x 23.02 17.16 1.0 17.22 17.87
(i686 = Intel(R) Pentium(R) 4 CPU 2.66GHz, 0.376 nsec)
(Core Quad = Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 0.4167 ns )
As expected, the optimization flag -O4 just only marginally speed-up the computation, because ones are actually passed to be computed by the co-processor.
From the table, It is directly visible that pow(x,a) is computed as exp(a*log(x)) because 5.5+7.5 is approximately 13.0. Moreover, the results perhaps shows that computing of pow() by direct use of the identity exp(..log) gives a little bit faster computation.
Also log10 is computed perhaps as log(x)/log(10) because 1.7+7.5 is 9.5.
Computation times divides functions onto groups by duration:
- fast: arithmetical operations, fabs
- medium: sin,cos, tan, atan, cbtr, sqrt, asin, acos, sinh, cosh, log(10), exp, atan2
- slow: pow
By the way, I changed computation of x^(1/3) from pow(x,1.0/3.0) to cbtr(x) which speeds-up my code just only about few percents. Of course, the slower computation of a general power is not solved by the way.
Finally, computation times looks horribly but I think that a method which computes a function with precision of 15 decimals just only 7 times slower (!) than basic arithmetical operations is a miracle. The real world miracle.:)
2 comments:
I tried it on AMD Athlon II X3 415e (5k bogomips) and got similar results:
122.85ns with -O4
For mockers of interpreted languages I did some fast comparison with Numerical Python
(in ipython shell). It might be somewhat slowed by the need of 1e4 allocations...
In [16]: %time a=[sum(pow(r_[0:1:10000j],1.0/3.0)) for i in range(10000)]
CPU times: user 13.85 s, sys: 0.12 s, total: 13.97 s
which means 138.5ns per op... quite nice
Rather surprising were outcomes of other tests
In [24]: %time a=[sum(sin(r_[0:1:10000j])) for i in range(10000)]
CPU times: user 4.04 s, sys: 0.04 s, total: 4.08 s
and espec.
In [20]: %time a=[len(sqrt(r_[0:1:10000j])) for i in range(10000)]
CPU times: user 2.10 s, sys: 0.01 s, total: 2.10 s
which seems to be considerably faster than the compiled version!
Errata: C-compiled sin() on my computer lasts 37.88 ns and compiled sqrt() 18.12. Maybe AMD has some different approach to different functions? Or is it only the compiler that matters?
Post a Comment