Matrix multiplication on workstations

(see also: bench_bm_mm.txt)

The first test (Table 1) is simple compilation; the second one (Table 2) is "best effort".


TABLE 1. Program: p25 (simple do-loops, vector loop inside). MFLOPS is defined as 2 N**3 / (CPU time).
Machine Opt. level ND N CPU (sec) MFLOPS
HP 710 32 MB 1 400 200 8
(ur) 400 63
2, 3 400 200 2
400 17 8

HP 715/75 64 MB 1 400 200 5
(jupiter) 400 39
2, 3 400 200 1
400 12 11

SGI Indy R4600/133 SC 1 400 200 4
64 MB (uranus) 400 37
2 400 200 1
400 13
3 400 200 1
400 13 11

SGI Indy R4600/100 PC 1 400 200
32 MB (atlas) 400 47
2 400 200
400 15
3 400 200 1
400 15 10
SGI Indy R5000/150 SC 1 400 200
64 MB (new uranus) 400
** 2 400 200
400 9.5
3 400 200 1
400 9 14
Pentium Pro P6/200 1 400 200 1
64 MB (f9pc00) 400 6
2, 3 400 200 0.6
400 4.6 28

** Compiled on R4600/133 SC.
Increase of the CPU time if we put the summation loop to be innermost:
SGI Indy R4600/133 SC times 2
HP 715/75 times 3.5


TABLE 2. Program: p25v (calls DGEMM).
Machine Opt. level ND N CPU (sec) MFLOPS
SGI Indy R4600/133 SC all 400 200 0.6
64 MB (uranus) * 400 4.2 30
400 200
401 4.1 30
SGI Indy R4600/100 PC all 400 200 0.9
32 MB (atlas) * 400 6.7 19
400 200
401 6.7 19
SGI Indy R5000/150 SC all 400 200 0.32 50
64 MB (uranus) * ** 400 2.64 48
Pentium Pro P6/200 all 400 200 1.6
64 MB (f9pc00) *1 400 12.6 10
* Non-interleaved memory.
** Compiled on R4600/133 SC.
*1 -lblas.