hroch's diary: Benchmarking of mathematical co-processors

While working on speed optimizations of fitspng, I encountered that the power function pow() slows extremely full run. (pow() is used in conversion from CIE XYZ to CIE Luv).

My inspection of glibc sources revealed that numerical functions provided by GNU glibc just only wraps implementation of ones by mathematical co-processor. No own rational approximation is implemented. The hardware support is directly assured by C99 IEEE standard.

Following the track, I created a small test code to test duration of execution various computing functions in GNU system math library. One is practically tests implementation of the functions in mathematical co-processors.

Fortran

program funs

integer, parameter :: double = selected_real_kind(15)
integer, parameter :: nsteps = 10000
integer :: i,j
real(double) :: x,y, s, xsteps = nsteps

s = 0.0
do j = 1,nsteps
do i = 1,nsteps
x = i/xsteps
y = x*(1.0_double/3.0_double)
s = s + y
end do
end do

write(*,*) s
end program funs

C

#include 

int main()

{

  const int nsteps = 10000;

  int i,j;

  double xsteps = (double) nsteps;

  double x,y,s;

  s = 0.0;

  for(j = 1; j <= nsteps; j++)

    for(i = 1; i <= nsteps; i++) {

      x =  (double) i /xsteps;

/*!*/ y = pow(x,1.0/3.0);

      s += y;

}

  return s > 0.0 ? 1 : 0;

}

(note that no-effect summations and return values are added to prevent compiler's over-optimizations)

The function in line /*!*/ have been changed.

Computations has been done on Intel(R) Core Quad and verified on Intel(R) Xeon machines (cca 30% faster). The gcc and gfortran compilers gives the same results, gfortran may give a litte bit faster code).

Results are summarized in following table.

Core Quad
[nsec] [nsec] ratio fortran i686
-O -O4 -O4
pow(x,1.0/3.0) 128.56 123.01 7.2 123.06 287.68
exp(1/3*log(x)) 120.21 116.14 6.8 117.14 269.20
cbtr(x) 90.15 85.79 5.0 - -
sinh(x) 95.15 91.80 5.4 91.99 235.34
log10(x) 95.74 91.14 5.4 91.73 95.35
atan2(x,2.0) 83.94 79.22 4.6 79.09 106.00
log(x) 79.30 74.98 4.4 74.88 95.22
cosh(x) 78.13 74.34 4.4 74.61 198.99
tan(x) 78.89 73.29 4.3 73.50 118.17
atan(x) 64.27 58.02 3.4 57.99 101.92
exp(x) 58.97 54.58 3.2 54.55 143.93
asin(x) 58.52 53.87 3.2 53.44 165.31
acos(x) 58.70 55.30 3.2 54.94 165.32
cos(x) 51.63 45.57 2.7 45.64 93.98
sin(x) 50.03 44.29 2.6 44.77 92.30
sqrt(x) 47.68 37.23 2.2 39.35 34.11
fabs(x) 20.93 16.14 1.0 16.11 17.25
(1.0/3.0)*x 23.02 17.16 1.0 17.22 17.87

(i686 = Intel(R) Pentium(R) 4 CPU 2.66GHz, 0.376 nsec)
(Core Quad = Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz, 0.4167 ns )

As expected, the optimization flag -O4 just only marginally speed-up the computation, because ones are actually passed to be computed by the co-processor.

From the table, It is directly visible that pow(x,a) is computed as exp(a*log(x)) because 5.5+7.5 is approximately 13.0. Moreover, the results perhaps shows that computing of pow() by direct use of the identity exp(..log) gives a little bit faster computation.
Also log10 is computed perhaps as log(x)/log(10) because 1.7+7.5 is 9.5.

Computation times divides functions onto groups by duration:

fast: arithmetical operations, fabs
medium: sin,cos, tan, atan, cbtr, sqrt, asin, acos, sinh, cosh, log(10), exp, atan2
slow: pow

By the way, I changed computation of x^(1/3) from pow(x,1.0/3.0) to cbtr(x) which speeds-up my code just only about few percents. Of course, the slower computation of a general power is not solved by the way.

Finally, computation times looks horribly but I think that a method which computes a function with precision of 15 decimals just only 7 times slower (!) than basic arithmetical operations is a miracle. The real world miracle.:)

2 comments:

Limu said...: I tried it on AMD Athlon II X3 415e (5k bogomips) and got similar results:
122.85ns with -O4

For mockers of interpreted languages I did some fast comparison with Numerical Python
(in ipython shell). It might be somewhat slowed by the need of 1e4 allocations...

In [16]: %time a=[sum(pow(r_[0:1:10000j],1.0/3.0)) for i in range(10000)]
CPU times: user 13.85 s, sys: 0.12 s, total: 13.97 s

which means 138.5ns per op... quite nice

Rather surprising were outcomes of other tests

In [24]: %time a=[sum(sin(r_[0:1:10000j])) for i in range(10000)]
CPU times: user 4.04 s, sys: 0.04 s, total: 4.08 s

and espec.

In [20]: %time a=[len(sqrt(r_[0:1:10000j])) for i in range(10000)]
CPU times: user 2.10 s, sys: 0.01 s, total: 2.10 s

which seems to be considerably faster than the compiled version!; 11/5/12 20:02
Limu said...: Errata: C-compiled sin() on my computer lasts 37.88 ns and compiled sqrt() 18.12. Maybe AMD has some different approach to different functions? Or is it only the compiler that matters?; 11/5/12 20:49

2012-04-19

Benchmarking of mathematical co-processors

2 comments: