
Features:
HPC BENCHMARKS: GOING FOR GOLD IN A COMPUTER OLYMPIAD?
by Christopher Lazou
"Panta Rei" - everything is in a state of flux - and knowledge is based on the
perception of the senses. Heraclitus of Ephesus (circa 500BC).
The HPC Challenge benchmark is providing a new in-depth analysis of system
performance. To retain the competitive sporting spirit of fastest systems
lists, I propose the HPCC fraternity adopt the Olympiad medal convention. The
top system is awarded a Gold medal, the 2nd Silver and 3rd Bronze for each
benchmark group. The system with the largest number of Gold medals can be
declared the winner.
To put all this in context, the Linpack Benchmark with its single number
measure has done wonders in the last 25 years or so, in highlighting marketing
potential for new computer systems especially in the scientific/engineering
HPC domain. The automated standard procedures rigorously enforced by its
originators Jack Dongarra et al, the free availability of the results, which
are now distributed via website and the top systems list has made Linpack one
of the best known statistic's in HPC performance evaluation circles. For
vendors and computer purchases, it was manna from heaven. Benchmarking on
Linpack was simple to perform and the single number easy to understand. Alas,
reality became a victim as no single number can reflect overall performance
even of the simplest computer systems, let alone complex ones.
The convergence to using Linpack was not from lack of choice. In a 1991 survey
of benchmarks published as a report by the National Physical Laboratory in the
UK, 32 computer benchmarks were listed and briefly described. These included,
many kernel based ones, such as NAS kernels, GENESIS aimed at evaluating
Distributed Memory, MIMD machines, EUROBEN, Linpack, PERFECT and so on. The
Perfect benchmark was developed with NSF funding at the university of Illinois
(circa 1987). It was a set of 13 complete application programs consisting of
about 60,000 lines, of Fortran. The initial impetus for developing the Perfect
Benchmarks was a growing dissatisfaction with the performance results obtained
from the kernel and algorithm benchmarks existing at that time.
The results from the Perfect benchmarks tended to emphasize a significant
disparity between the performance on 'real' problems and that on more
homogeneous and architectural benchmarks. They also emphasize the
'instability' of high performance computers: the extent to which the advanced
hardware features appear unable to sustain uniform performance improvement
over all aspects of an application problem. Does this sound familiar? To cut
the story short, efforts to port the Perfect benchmark to SIMD and distributed
memory MPPs were rarely attempted because the effort was (at the time) too
laborious. Lack of long term funding and the large effort required for keeping
Perfect relevant to new hardware developments became its Achilles heel, which
eventually caused it to fall by the wayside.
Thus, although more comprehensive benchmarks were successfully implemented in
the 1980s, the effort required to run them and continuously update them to
reflect the new computer hardware, has caused their demise, or at best
consigned them to a very specialized domain of private users. My own efforts
known as the ULCC benchmark with full workload characterization was too costly
to ran and suffered the same fate. The reason for this outcome is simple. Most
of these benchmarks were either too application specific, too narrow in design
objectives, too parochial with a close focus on specific hardware
architectures and often perceived as a measuring tool for a specific
procurement. In the end the key ingredient for general acceptance was missing;
these benchmarks failed to achieve universality, the most appealing attribute
of Linpack.
Attempts to produce more relevant benchmarks continued. In the 1990s, the late
Roger Hockney further developed the GENESIS suite, Ad van de Steen improved
EUROBEN, which stresses the system under test, identifying performance
behavior across its range, but offers no magic single number for the HPC
community to latch on. There was also the European PEPS project, the RAPs
project concentrating on weather/climate codes, Marty Guest's concentrating on
Chemistry codes and so on.
With systems becoming more complex, the need for a codified model taking into
account the different computational elements and quantifying their influence
on performance rather than relying on the single measure from Linpack, was
gaining urgency. The IDC started the ball rolling and the HPC Challenge took
the bait. It was sponsored by DARPA and the National Science Foundation and
the Department of Energy. They wanted something to measure the overall
effectiveness of computers and they realized that Linpack was not good enough.
HPCC started with basically 5 benchmarks, including HPL (Linpack) with MPI on
whole system (Ax = b) as the first benchmark, but more importantly it provided
a framework (harness) for adding benchmarks of interest at a later date. At
last we arrived at a model, which can evolve over time to handle new hardware.
Apart from measuring CPU performance, the memory system and interconnect under
various stress conditions, it allows for optimizations, records effort needed
for tuning, and provides verification and archive of results.
The current HPC Challenge benchmark measures the performance of several
elements of a machine. It consists of a set of 23 measurements in eight
groups. It supplements and extends Linpack, exercises critical features of an
HPC machine, calculation speed, memory access, MPI communication and
application kernels.
Data mining these HPCC results one can obtain invaluable insights on the
strength of a particular system and its productivity for a particular
application domain. For example, digital signal processing has low spatial and
high temporal locality while computational fluid dynamics has low temporal and
high spatial locality. Analysing the memory access patterns of the various
systems one can evaluate their strength and weakness concerning computational
elements of a particular application. The total performance is the integral of
all the results from computational elements factoring reductions caused by
bottlenecks from combinatorial interactions.
As stated above, to retain the competitive sporting spirit of the top systems
list, the top system is awarded a Gold medal, the 2nd Silver and 3rd Bronze
for each benchmark. The system with the largest number of Gold medals can be
declared the winner. A more refined measure is to normalize on the top
performing (Gold medal) system for each benchmark so it becomes clear how big
the performance gap is between the competing systems. As an exercise, I took
the results from the HPCC website, 3rd January 2005 and below are the
"Olympiad" results I found. (Please note that the comparison is based on
results available on 3rd January and as some of the latest systems such as the
Cray X1E, the NEC SX-8, the IBM P5 and Blue Gene/L, have not posted any
measured results, the winners are likely to differ in the future).
Definitions of benchmark Groups in HPC Challenge
1. G-HPL (system performance)
"Solves a randomly generated dense linear system of equations in double
floating-point precision (IEEE 64-bit) arithmetic using MPI. The linear
system matrix is stored in a two-dimensional block-cyclic fashion and
multiple variants of code are provided for computational kernels and
communication patterns. The solution method is LU factorization through
Gaussian elimination with partial row pivoting followed by a backward
substitution. Unit: Teraflops per Second"
Medal Winners
- Cray X1 MSP (252 Procs @0.8GHz) - 2.3847Tflop/s 72.8% peak; Gold (100)
- NEC SX-6 (192 Procs @ 0.5GHz) - 1.3271Tflop/s 86.4% peak; Silver (117)
- IBM P4+ (256 Procs @ 1.7GHz) - 1.0744Tflop/s 61.7% peak; Bronze (85)
[Note that when normalizing on efficiency relative to peak performance, the
NEC SX-6, would be in Gold and the Cray X1 in Silver position].
2. G-PTRANS (A=A+B^T, MPI) (system performance)
"Implements a parallel matrix transpose for two-dimensional block-cyclic
storage. It is an important benchmark because it exercises the
communications of the computer heavily on a realistic problem where pairs of
processors communicate with each other simultaneously. It is a useful test
of the total communications capacity of the network. Unit: Giga Bytes per
Second"
Medal Winners
- Cray X1 MSP (252 Procs @0.8GHz) - 97.408GB/s Gold (100)
- NEC SX-6 (192 Procs @ 0.5GHz) - 92.968GB/s Silver (95)
- IBM P4+ (256 Procs @ 1.7GHz) - 23.721GB/s Bronze (24)
3. G-Random Access (system performance)
"Global Random Access, also called GUPs, measures the rate at which the
computer can update pseudo-random locations of its memory - this rate is
expressed in billions (giga) of updates per second (GUP/s). Unit: Giga
Updates per Second"
Medal Winners
- Cray Alpha 21264 (512 Procs. @0.675GHz) - 0.028946GUP/s Gold (100)
- Cray AMD Opteron (64 Procs @2.2GHz) - 0.022397GUP/s Silver (77)
- SGI MIPS R16000 (500 Procs @0.7GHz) - 0.018297GUP/s Bronze (63)
4. EP-STREAM (per CPU)
"The Embarrassingly Parallel STREAM benchmark is a simple synthetic
benchmark program that measures sustainable memory bandwidth and the
corresponding computation rate for simple numerical vector kernels. It is
run in embarrassingly parallel manner - all computational nodes perform the
benchmark at the same time, the arithmetic average rate is reported. Unit:
Giga Bytes per Second"
Medal Winners
- NEC SX-7 (32Procs @0.552GHz) - 492.161GB/s Gold (100)
- Cray X1 MSP (64Procs @0.8GHz) - 14.990GB/s Silver (3)
- IBM P4+ (128Procs @1.7GHz) - 7.722GB/s Bronze (1.6)
[Note: The SX-7 has 32 procs in one node i.e. no interconnect is involved.
When the SX-6 is used with 128procs the interconnect kicks in and the
performance changes to 27.088GB/s, Gold (100), with the Cray X1 Silver (55)
and IBM P4+ Bronze (28.5)].
5. G-FFTE (system performance)
"Global FFTE performs the same test as FFTE but across the entire system by
distributing the input vector in block fashion across all the nodes. Unit:
Giga Flops per Second"
Medal Winners
- NEC SX-6 (128Procs @ 0.5GHz) - 37.158GB/s. Gold (100)
- Cray AMD Opteron (64 Procs @2.2GHz) - 16.361GB/s Silver (44)
- Cray Alpha 21264 (512 Procs. @0.675GHz) - 15.477GB/s Bronze (42)
6. EP-DGEMM (per CPU)
"The Embarrassingly Parallel DGEMM benchmark measures the floating-point
execution rate of double precision real matrix-matrix multiply performed by
the DGEMM subroutine from the BLAS (Basic Linear Algebra Subprograms). It is
run in embarrassingly parallel manner - all computational nodes perform the
benchmark at the same time, the arithmetic average rate is reported. Unit:
Giga Flops per Second"
Medal Winners
- NEC SX-7 (32Procs @0.552GHz) - 140.636Gflop/s Gold (100)
- IBM P4+ (256 Procs @ 1.7GHz) - 17.979Gflop/s Silver (13)
- Dell Inc Intel Xeon EM64T (64Procs @3.4GHz) - 6.081Gflop/s Bronze (4)
7. Random Ring Bandwidth (per CPU)
"Randomly Ordered Ring Bandwidth, reports bandwidth achieved in the ring
communication pattern. The communicating nodes are ordered randomly in the
ring (with respect to the natural ordering of the MPI default communicator).
The result is averaged over various random assignments of processes in the
ring. Unit: Giga Bytes per second per CPU"
Medal Winners
- NEC SX-7 (32Procs @0.552GHz) - 8.14753GB/s Gold (100)
- Cray X1 MSP (60Procs @0.8GHz) - 1.03291GB/s Silver (12.7)
- IBM P4+ (64 Procs @ 1.7GHz) - 0.74828GB/s Bronze (9.2)
8. Random Ring Latency (per CPU)
"Randomly-Ordered Ring Latency, reports latency in the ring communication
pattern. The communicating nodes are ordered randomly in the ring (with
respect to the natural ordering of the MPI default communicator) in the
ring. The result is averaged over various random assignments of processes in
the ring. Unit: micro-seconds"
Medal Winners
- Cray AMD Opteron (64 Procs @2.2GHz) - 1.63micros Gold (100)
- NEC SX-7 (32Procs @0.552GHz) - 4.85micros Silver (33.6)
- SGI Intel Itanium2 (32Procs @1.3GHz) - 5.79micros Bronze (28.2)
The final list of winners and medalists from the above analysis are:
- The NEC SX series systems, 4 Gold and 3 Silver medals,
- The Cray X1 SMP system 2 Gold and 2 Silver medals,
- The Cray AMD Opteron system (Cray XD1), 1 Gold and 2 Silver medals,
- The Cray Alpha system, 1 Gold and 1 Bronze medals,
- The IBM Power P4+ system, 1Silver and 4 Bronze medals,
- The SGI Itanium2 system, 2 Bronze and,
- The Dell Inc. Intel Xeon, 1 Bronze.
The HPC Challenge benchmark clearly shows the strength of parallel vector
architectures and highly integrated systems with high bandwidth, low latency
memory subsystems. The difference in performance between a vector and a scalar
system can be up to a factor of 60.
As for winners of future computer systems Olympiad competitions, "Panta Rei" -
everything is in a state of flux.
(The definitions of benchmark Groups in HPC Challenge are from their website
hence the use of quotes. Brands and names are the property of their respective
owners).
Copyright: Christopher Lazou, HiPerCom Consultants, Ltd., UK. January
2005
|