HPCwire
 The global publication of record for High Performance Computing / January 21, 2005: Vol. 14, No. 3

Previous Article   |  Table of Contents  |  

Features:

HPC BENCHMARKS: GOING FOR GOLD IN A COMPUTER OLYMPIAD?
by Christopher Lazou

"Panta Rei" - everything is in a state of flux - and knowledge is based on the perception of the senses. Heraclitus of Ephesus (circa 500BC).


The HPC Challenge benchmark is providing a new in-depth analysis of system performance. To retain the competitive sporting spirit of fastest systems lists, I propose the HPCC fraternity adopt the Olympiad medal convention. The top system is awarded a Gold medal, the 2nd Silver and 3rd Bronze for each benchmark group. The system with the largest number of Gold medals can be declared the winner.

To put all this in context, the Linpack Benchmark with its single number measure has done wonders in the last 25 years or so, in highlighting marketing potential for new computer systems especially in the scientific/engineering HPC domain. The automated standard procedures rigorously enforced by its originators Jack Dongarra et al, the free availability of the results, which are now distributed via website and the top systems list has made Linpack one of the best known statistic's in HPC performance evaluation circles. For vendors and computer purchases, it was manna from heaven. Benchmarking on Linpack was simple to perform and the single number easy to understand. Alas, reality became a victim as no single number can reflect overall performance even of the simplest computer systems, let alone complex ones.

The convergence to using Linpack was not from lack of choice. In a 1991 survey of benchmarks published as a report by the National Physical Laboratory in the UK, 32 computer benchmarks were listed and briefly described. These included, many kernel based ones, such as NAS kernels, GENESIS aimed at evaluating Distributed Memory, MIMD machines, EUROBEN, Linpack, PERFECT and so on. The Perfect benchmark was developed with NSF funding at the university of Illinois (circa 1987). It was a set of 13 complete application programs consisting of about 60,000 lines, of Fortran. The initial impetus for developing the Perfect Benchmarks was a growing dissatisfaction with the performance results obtained from the kernel and algorithm benchmarks existing at that time.

The results from the Perfect benchmarks tended to emphasize a significant disparity between the performance on 'real' problems and that on more homogeneous and architectural benchmarks. They also emphasize the 'instability' of high performance computers: the extent to which the advanced hardware features appear unable to sustain uniform performance improvement over all aspects of an application problem. Does this sound familiar? To cut the story short, efforts to port the Perfect benchmark to SIMD and distributed memory MPPs were rarely attempted because the effort was (at the time) too laborious. Lack of long term funding and the large effort required for keeping Perfect relevant to new hardware developments became its Achilles heel, which eventually caused it to fall by the wayside.

Thus, although more comprehensive benchmarks were successfully implemented in the 1980s, the effort required to run them and continuously update them to reflect the new computer hardware, has caused their demise, or at best consigned them to a very specialized domain of private users. My own efforts known as the ULCC benchmark with full workload characterization was too costly to ran and suffered the same fate. The reason for this outcome is simple. Most of these benchmarks were either too application specific, too narrow in design objectives, too parochial with a close focus on specific hardware architectures and often perceived as a measuring tool for a specific procurement. In the end the key ingredient for general acceptance was missing; these benchmarks failed to achieve universality, the most appealing attribute of Linpack.

Attempts to produce more relevant benchmarks continued. In the 1990s, the late Roger Hockney further developed the GENESIS suite, Ad van de Steen improved EUROBEN, which stresses the system under test, identifying performance behavior across its range, but offers no magic single number for the HPC community to latch on. There was also the European PEPS project, the RAPs project concentrating on weather/climate codes, Marty Guest's concentrating on Chemistry codes and so on.

With systems becoming more complex, the need for a codified model taking into account the different computational elements and quantifying their influence on performance rather than relying on the single measure from Linpack, was gaining urgency. The IDC started the ball rolling and the HPC Challenge took the bait. It was sponsored by DARPA and the National Science Foundation and the Department of Energy. They wanted something to measure the overall effectiveness of computers and they realized that Linpack was not good enough.

HPCC started with basically 5 benchmarks, including HPL (Linpack) with MPI on whole system (Ax = b) as the first benchmark, but more importantly it provided a framework (harness) for adding benchmarks of interest at a later date. At last we arrived at a model, which can evolve over time to handle new hardware. Apart from measuring CPU performance, the memory system and interconnect under various stress conditions, it allows for optimizations, records effort needed for tuning, and provides verification and archive of results.

The current HPC Challenge benchmark measures the performance of several elements of a machine. It consists of a set of 23 measurements in eight groups. It supplements and extends Linpack, exercises critical features of an HPC machine, calculation speed, memory access, MPI communication and application kernels.

Data mining these HPCC results one can obtain invaluable insights on the strength of a particular system and its productivity for a particular application domain. For example, digital signal processing has low spatial and high temporal locality while computational fluid dynamics has low temporal and high spatial locality. Analysing the memory access patterns of the various systems one can evaluate their strength and weakness concerning computational elements of a particular application. The total performance is the integral of all the results from computational elements factoring reductions caused by bottlenecks from combinatorial interactions.

As stated above, to retain the competitive sporting spirit of the top systems list, the top system is awarded a Gold medal, the 2nd Silver and 3rd Bronze for each benchmark. The system with the largest number of Gold medals can be declared the winner. A more refined measure is to normalize on the top performing (Gold medal) system for each benchmark so it becomes clear how big the performance gap is between the competing systems. As an exercise, I took the results from the HPCC website, 3rd January 2005 and below are the "Olympiad" results I found. (Please note that the comparison is based on results available on 3rd January and as some of the latest systems such as the Cray X1E, the NEC SX-8, the IBM P5 and Blue Gene/L, have not posted any measured results, the winners are likely to differ in the future).

Definitions of benchmark Groups in HPC Challenge

1. G-HPL (system performance)

"Solves a randomly generated dense linear system of equations in double floating-point precision (IEEE 64-bit) arithmetic using MPI. The linear system matrix is stored in a two-dimensional block-cyclic fashion and multiple variants of code are provided for computational kernels and communication patterns. The solution method is LU factorization through Gaussian elimination with partial row pivoting followed by a backward substitution. Unit: Teraflops per Second"

Medal Winners

  1. Cray X1 MSP (252 Procs @0.8GHz) - 2.3847Tflop/s 72.8% peak; Gold (100)
  2. NEC SX-6 (192 Procs @ 0.5GHz) - 1.3271Tflop/s 86.4% peak; Silver (117)
  3. IBM P4+ (256 Procs @ 1.7GHz) - 1.0744Tflop/s 61.7% peak; Bronze (85)

[Note that when normalizing on efficiency relative to peak performance, the NEC SX-6, would be in Gold and the Cray X1 in Silver position].

2. G-PTRANS (A=A+B^T, MPI) (system performance)

"Implements a parallel matrix transpose for two-dimensional block-cyclic storage. It is an important benchmark because it exercises the communications of the computer heavily on a realistic problem where pairs of processors communicate with each other simultaneously. It is a useful test of the total communications capacity of the network. Unit: Giga Bytes per Second"

Medal Winners

  1. Cray X1 MSP (252 Procs @0.8GHz) - 97.408GB/s Gold (100)
  2. NEC SX-6 (192 Procs @ 0.5GHz) - 92.968GB/s Silver (95)
  3. IBM P4+ (256 Procs @ 1.7GHz) - 23.721GB/s Bronze (24)

3. G-Random Access (system performance)

"Global Random Access, also called GUPs, measures the rate at which the computer can update pseudo-random locations of its memory - this rate is expressed in billions (giga) of updates per second (GUP/s). Unit: Giga Updates per Second"

Medal Winners

  1. Cray Alpha 21264 (512 Procs. @0.675GHz) - 0.028946GUP/s Gold (100)
  2. Cray AMD Opteron (64 Procs @2.2GHz) - 0.022397GUP/s Silver (77)
  3. SGI MIPS R16000 (500 Procs @0.7GHz) - 0.018297GUP/s Bronze (63)

4. EP-STREAM (per CPU)

"The Embarrassingly Parallel STREAM benchmark is a simple synthetic benchmark program that measures sustainable memory bandwidth and the corresponding computation rate for simple numerical vector kernels. It is run in embarrassingly parallel manner - all computational nodes perform the benchmark at the same time, the arithmetic average rate is reported. Unit: Giga Bytes per Second"

Medal Winners

  1. NEC SX-7 (32Procs @0.552GHz) - 492.161GB/s Gold (100)
  2. Cray X1 MSP (64Procs @0.8GHz) - 14.990GB/s Silver (3)
  3. IBM P4+ (128Procs @1.7GHz) - 7.722GB/s Bronze (1.6)

[Note: The SX-7 has 32 procs in one node i.e. no interconnect is involved. When the SX-6 is used with 128procs the interconnect kicks in and the performance changes to 27.088GB/s, Gold (100), with the Cray X1 Silver (55) and IBM P4+ Bronze (28.5)].

5. G-FFTE (system performance)

"Global FFTE performs the same test as FFTE but across the entire system by distributing the input vector in block fashion across all the nodes. Unit: Giga Flops per Second"

Medal Winners

  1. NEC SX-6 (128Procs @ 0.5GHz) - 37.158GB/s. Gold (100)
  2. Cray AMD Opteron (64 Procs @2.2GHz) - 16.361GB/s Silver (44)
  3. Cray Alpha 21264 (512 Procs. @0.675GHz) - 15.477GB/s Bronze (42)

6. EP-DGEMM (per CPU)

"The Embarrassingly Parallel DGEMM benchmark measures the floating-point execution rate of double precision real matrix-matrix multiply performed by the DGEMM subroutine from the BLAS (Basic Linear Algebra Subprograms). It is run in embarrassingly parallel manner - all computational nodes perform the benchmark at the same time, the arithmetic average rate is reported. Unit: Giga Flops per Second"

Medal Winners

  1. NEC SX-7 (32Procs @0.552GHz) - 140.636Gflop/s Gold (100)
  2. IBM P4+ (256 Procs @ 1.7GHz) - 17.979Gflop/s Silver (13)
  3. Dell Inc Intel Xeon EM64T (64Procs @3.4GHz) - 6.081Gflop/s Bronze (4)

7. Random Ring Bandwidth (per CPU)

"Randomly Ordered Ring Bandwidth, reports bandwidth achieved in the ring communication pattern. The communicating nodes are ordered randomly in the ring (with respect to the natural ordering of the MPI default communicator). The result is averaged over various random assignments of processes in the ring. Unit: Giga Bytes per second per CPU"

Medal Winners

  1. NEC SX-7 (32Procs @0.552GHz) - 8.14753GB/s Gold (100)
  2. Cray X1 MSP (60Procs @0.8GHz) - 1.03291GB/s Silver (12.7)
  3. IBM P4+ (64 Procs @ 1.7GHz) - 0.74828GB/s Bronze (9.2)

8. Random Ring Latency (per CPU)

"Randomly-Ordered Ring Latency, reports latency in the ring communication pattern. The communicating nodes are ordered randomly in the ring (with respect to the natural ordering of the MPI default communicator) in the ring. The result is averaged over various random assignments of processes in the ring. Unit: micro-seconds"

Medal Winners

  1. Cray AMD Opteron (64 Procs @2.2GHz) - 1.63micros Gold (100)
  2. NEC SX-7 (32Procs @0.552GHz) - 4.85micros Silver (33.6)
  3. SGI Intel Itanium2 (32Procs @1.3GHz) - 5.79micros Bronze (28.2)

The final list of winners and medalists from the above analysis are:

  1. The NEC SX series systems, 4 Gold and 3 Silver medals,
  2. The Cray X1 SMP system 2 Gold and 2 Silver medals,
  3. The Cray AMD Opteron system (Cray XD1), 1 Gold and 2 Silver medals,
  4. The Cray Alpha system, 1 Gold and 1 Bronze medals,
  5. The IBM Power P4+ system, 1Silver and 4 Bronze medals,
  6. The SGI Itanium2 system, 2 Bronze and,
  7. The Dell Inc. Intel Xeon, 1 Bronze.

The HPC Challenge benchmark clearly shows the strength of parallel vector architectures and highly integrated systems with high bandwidth, low latency memory subsystems. The difference in performance between a vector and a scalar system can be up to a factor of 60.

As for winners of future computer systems Olympiad competitions, "Panta Rei" - everything is in a state of flux.

(The definitions of benchmark Groups in HPC Challenge are from their website hence the use of quotes. Brands and names are the property of their respective owners).

Copyright: Christopher Lazou, HiPerCom Consultants, Ltd., UK. January 2005


Top of Page

Previous Article   |  Table of Contents  |