Matrix multiplication on worsktations Preliminary test 96/06/27 Updated 97/10/30 (see also: bench_bm_mm.txt) TABLE 1. Matrix multiplication using program p25.f, compiling as f77 -On ..., i.e., using defaults. p25.f uses simple do-loops, the row index loop (vectorizable loop) being inside (locality). The MFLOPS entry is defined as 2 N**3 / (CPU time). ======================================================================== Machine Opt. level ND N CPU (sec) MFLOPS ------------------------------------------------------------------------ HP 710 32 MB 1 400 200 8 2 (ur) 400 63 2 2, 3 400 200 2 8 400 17 8 ------------------------------------------------------------------------ SGI Indy R4600/133 SC 1 400 200 4 4 64 MB (atlas) 400 37 4 2 400 200 1 400 13 10 3 400 200 1 400 13 10 ------------------------------------------------------------------------ HP 715/75 64 MB 1 400 200 5 3 (jupiter) 400 39 3 2, 3 400 200 1 400 12 11 ------------------------------------------------------------------------ SGI Indy R5000/150 SC 1 400 200 3.5 5 64 MB (uranus) 400 30 4 2 400 200 0.7 400 8 16 3 400 200 0.7 400 8 16 ------------------------------------------------------------------------ HP B132 32 MB 1 400 200 2 8 (phobos) 400 16 8 *1 2, 3 400 200 0.6 27 400 4.5 28 ------------------------------------------------------------------------ Pentium Pro P6/200 1 400 200 1 16 64 MB (f9pc00) 400 6 21 2, 3 400 200 0.6 27 400 4.6 28 ======================================================================== *1 Compiled on ur (HP-UX 9, HP 710), run on HP-UX 10 (no f77 yet on phobos). Increase of the CPU time if we put the summation loop to be innermost: SGI Indy R4600/133 SC: times 2 SGI Indy R5000/150 SC: times 2.6 HP 715/75: times 3.5 TABLE 2. As Table 1, but for program p25v.f, which calls DGEMM instead of using DO loops. Default f77 options; using precompiled system libraries where available (not compiling dgemm.f explicitly). ======================================================================== Machine Opt. level ND N CPU (sec) MFLOPS ------------------------------------------------------------------------ Pentium Pro P6/200 all 400 200 1.6 10 ? 64 MB (f9pc00) *1 400 12.6 10 ? ------------------------------------------------------------------------ SGI Indy R4600/100 PC all 400 200 32 MB (old atlas) * 400 19 ------------------------------------------------------------------------ SGI Indy R4600/133 SC all 400 200 0.6 27 64 MB (atlas) * 400 4.2 30 400 200 401 4.1 30 ------------------------------------------------------------------------ SGI Indy R5000/150 SC all 400 200 0.32 50 64 MB (uranus) * 400 2.64 48 ======================================================================== * Non-interleaved memory. ** Compilation on R5000: f77 -O3 -n32 -mips4 ... /usr/lib32/mips4/libblas.a. *1 f77 ... -lblas.