BENCHMARK: bm_mm (matrix multiplication) This benchmark tests basic operations with different program sizes. Number of MFLOPS used to calculate speed: 2 N**3. Precision: 8-byte (REAL*8). f90 may be slightly faster than f77 (SGI). Programs: p25: multiplication loops: innnermost loop is the product-column loop. p25v: multiplication loops replaced by a call to DGEMM (-lveclib, -lblas). p25_matmul: multiplication loops replaced by a call to matmul. TABLE I. CPU times and MFLOPS performance, calculated as 2N**3 divided by CPU time, for SGI Power Challenge L and Convex C3860. ND: array dimension. N: actual array size. Length is calculated as 3 ND**2 * 8 bytes. MFP: MFLOPS performance. R: MFP divided by theoretical peak performance. Convex: fc -O2 ...; SGI: f77 -O3 ... =============================================================================== ND N SGI (R8000, 75 MHz, 300 MFLOPS) C (120 MFLOPS) (Length (f77 -O3) (fc -O2) (MB)) --------------------------------- ------------------------------- p25 p25v p25 p25v --------------- ---------------- -------------- --------------- CPU MFP R CPU MFP R CPU MFP R CPU MFP R ------------------------------------------------------------------------------- 800 100 0.02 0.01 0.02 0.02 (15) 200 0.2 0.16 0.16 400 1.4 0.51 250 0.83 1.4 1.2 104 0.87 800 22 47 0.16 4.1 250 0.83 10.7 96 0.8 9.8 104 0.87 1600 800 4.2 244 0.81 10.1 101 0.84 (61) 1600 34.6 237 0.79 81.4 101 0.84 4000 1600 38.0 215 0.72 * 82.9 99 0.82 (384) 4000 607 210 0.70 * 1276 100 0.84 ------------------------------------------------------------------------------- * f90. TABLE I-1. As Table I, but for 90 MHz R8000. =============================================================================== ND N SGI (R8000, 90 MHz, 360 MFLOPS) (Length (f77 -O3) (MB)) --------------------------------- p25 p25v --------------- ---------------- CPU MFP R CPU MFP R ------------------------------------------------------------------------------- 800 800 3.5 293 0.81 1600 1600 29.5 278 0.77 ------------------------------------------------------------------------------- TABLE II. As in Table I, but using C = matmul(A, B) (p25_matmul.f). =============================================================================== ND N SGI (R8000, 75 MHz, 300 MFLOPS) (Length (f90 -O3) (MB)) --------------------------------- p25_matmul --------------- CPU MFP R ------------------------------------------------------------------------------- 800 100 9 0 (15) 200 9 2 400 9 14 800 9 113 0.38 1600 800 150 7 0.02 (61) 1600 150 55 0.18 ------------------------------------------------------------------------------- SUMMARY The C-series processor (a vector processor) is almost as efficient with an explicit multiplication loop as with the Veclib (Blas) routine DGEMM. The SGI (a pipelined processor) performance is drastically improved by the DGEMM routine (shipped with the machine). The use of matmul is pointless "just like that." Further -OPT: options should also be tested on the SGI PC. The 90 MHz R8000 processor has slightly smaller efficiency than the 75 HMz one; this could be due to memory speed which was the same in both tests. APPENDIX PROGRAM P25 C C EACH STEP TIMED SEPARATELY. C V1, 92/11/26. C IMPLICIT REAL*8 (A-H,O-Z) PARAMETER (ND = 800, NIN = 5, NOUT = 6) DIMENSION A(ND,ND), B(ND,ND), C(ND,ND) C WRITE (NOUT,200) READ (NIN,100) N WRITE (NOUT,201) ND, N CALL TEMPD(T0, T1, TD, NOUT, .TRUE.) DO 12 J = 1,N DO 10 I = 1,N A(I,J) = (J * I) B(I,J) = (J + I) 10 CONTINUE 12 CONTINUE CALL TEMPD(T1, T2, TD, NOUT, .TRUE.) DO 24 J = 1,N DO 18 I = 1,N C(I,J) = 0.D0 18 CONTINUE 24 CONTINUE CALL TEMPD(T2, T3, TD, NOUT, .TRUE.) DO 26 J = 1,N DO 22 K = 1,N DO 20 I = 1,N C(I,J) = C(I,J) + A(I,K) * B(K,J) 20 CONTINUE 22 CONTINUE 26 CONTINUE CALL TEMPD(T3, T4, TD, NOUT, .TRUE.) WRITE (NOUT,202) N, ND, ((C(I,J), J = 1,4), I = 1,4) C 100 FORMAT (I4) 200 FORMAT (1H , 8HP25 V1 ) 201 FORMAT (1H , 8HND, N , 2I8) 202 FORMAT (1H , 2I8, /, (1H , 4E16.8)) END PROGRAM P25V ... 24 CONTINUE CALL TEMPD(T2, T3, TD, NOUT, .TRUE.) CALL DGEMM('N', 'N', N, N, N, 1.0D0, & A, ND, B, ND, 0.D0, C, ND) CALL TEMPD(T3, T4, TD, NOUT, .TRUE.) ... END SUBROUTINE TEMPD(TOLD, TNEW, TDIF, NOUT, LPR) INTEGER NOUT REAL*8 TOLD, TNEW, TDIF LOGICAL LPR C C SGI PCL TIMING ROUTINE, USING DTIME. C DTIME REPORTS ELAPSED EXECUTION TIME (USER, SYSTEM) SINCE THE C LAST CALL TO ITSELF; THEREFORE THIS PROGRAM WORKS OK ONLY IF THERE C ARE NO OTHER CALLS TO DTIME EXCEPT FROM THIS PROGRAM. C (DTIME MAY NOT MEASURE THE CPU TIME ITSELF.) C C INPUT: TOLD, NOUT, LPR (ALL UNCHANGED). C OUTPUT: TNEW, TDIF. C LOGICAL INIT, TEST REAL*4 DTIME, TARRAY DIMENSION TARRAY(2) REAL*8 TSUM, SDIF DATA INIT / .TRUE. /, TEST / .FALSE. / C IF (INIT) THEN WRITE (NOUT,2000) TSUM = DTIME(TARRAY) TDIF = TARRAY(1) TNEW = TOLD + TDIF SDIF = TARRAY(2) INIT = .FALSE. IF (TEST) WRITE (NOUT,2200) TSUM, TNEW, TDIF, TOLD, SDIF RETURN ENDIF TSUM = DTIME(TARRAY) TDIF = TARRAY(1) TNEW = TOLD + TDIF SDIF = TARRAY(2) IF (TEST) WRITE (NOUT,2200) TSUM, TNEW, TDIF, TOLD, SDIF IF (LPR) WRITE (NOUT,2100) TNEW, TDIF, TOLD, SDIF RETURN C 2000 FORMAT (' -TEMPD- 3.0, 95/12/20. SGI PC TIMING ROUTINE.') 2100 FORMAT (' TIME', F12.2, A ' DIF', F12.2, B ' REF', F12.2, C ' DIF SYS', F12.2) 2200 FORMAT (' SUM USER DIF OLD SYS', 5F8.2) END