HPCwire
 The global publication of record for High Performance Computing / June 25, 2004: Vol. 13, No. 25

Previous Article   |  Table of Contents  |  

Features:

TOP500 AND HPCC BENCHMARKS -- WHAT THEY CAN AND CAN'T DO
Commentary from the High-End Crusader

A recent HPCwire article mused, would a new Chinese cluster computer with 2,000 AMD Opteron processors and a reported 10 TFs/s of peak performance make it into the top 15 of the Top500 list? (The news hook here is the alleged potential for nuclear-weapons design). After the requisite mention of the Japanese Earth Simulator, which currently sits atop the list with 40 TFs/s of peak performance, attention turned to other contenders. Would Cray's Red Storm computer at Sandia make the November list? How would new clusters at Livermore and Los Alamos do? Would the Cray leadership-class system at Oak Ridge reach 100 and then 250 TFs/s? And would IBM's Blue Gene/L cluster at Livermore reach 360 TFs/s?

All this is slightly curious. What exactly do such numbers mean? And why are Linpack R_max ratings, which lie at the heart of the Top500 list, always such sizeable fractions of system peak flops (88% for the Earth Simulator)?

There is no reason whatsoever to think that parallel architectures as diverse as those mentioned in the first paragraph can usefully be measured by the _single_ Linpack R_max benchmark. The typical (conventional) U.S. high-end computer is a large-scale cluster of scalar SMP nodes. The Earth Simulator is a cluster of vector SMP nodes. Both are shared memory within nodes and message-passing distributed memory across nodes. In contrast, Red Storm is a Linux multicomputer; its nodes are individual Opteron processors and local memory. Red Storm is pure message-passing distributed memory across nodes.

Cray's X1 supercomputer, which---along with Red Storm---is an integral part of the leadership-class system at Oak Ridge, is a distributed shared-memory multiprocessor with vector SMP nodes. The Cray X1, and later the Cray X2, will be components of the National Leadership Computing Facility (NLCF) at Oak Ridge; both are shared memory within nodes and distributed shared memory across nodes. Obviously, these widely diverse machines are appropriate for computing widely diverse application/algorithm classes: they are good at computing very different things!

Historically, high-end computing has painted itself into an awkward corner by focusing inordinate attention on the relative performance of different systems on the Linpack R_max benchmark, which is used to rank machines on the Top500 list. Linpack is a benchmark from dense linear algebra. It has such a high degree of producer-consumer temporal locality that it represents (is typical of) only those applications that require significant floating-point performance but need little in the way of memory access, communication, and synchronization. If your critical application requires almost none of _any_ of these things, then perhaps _only_ the Linpack R_max performance number is relevant to you.

Any specialist in performance modeling of high-end computers will tell you that the field of high-end computing would benefit enormously from one or more community-standard benchmarks to complement the scalable Linpack benchmark that has been used for many years. One advantage of Linpack is that it can be arbitrarily scaled to match the increased capabilities of newer systems. But it emphasizes performance on regular, dense-matrix operations that possess strong data locality, and ignores the realm of irregular, sparse-data access typical of the majority of modern scientific-computing applications.

There is an undeniable political component to Linpack's longevity as the primary execution metric. It can be expressed in one word: machoflops. After providing for Iraq and homeland security, U.S. politicians are only willing to spend money to procure "fast" computers. Unfortunately, _peak_ machoflops (even R_max machoflops) are a very poor indicator of (actual) performance on many important scientific applications.

Indeed, the Linpack R_max performance number, which is an _upper bound_ on performance when there is _exceptional_ temporal locality, offers no insights to the application developer into how the algorithmic characteristics of his application will affect execution time, and no insights to the system designer into which architectural characteristics of a computer system are important.

The smart money is now rallying in support of the HPC Challenge benchmark suite, co-sponsored by DARPA's High-Productivity Computing Systems (HPCS) program, DOE Office of Science, and NSF, and introduced by Jack Dongarra and co-workers at SC2003. Linpack stays, but there are other benchmarks in the suite that stress various hardware aspects of a system that govern the rate at which memory operands can be supplied to processors.

Important user organizations have called for widespread adoption of the HPCC benchmark suite. For example, Thomas Zacharia, ORNL's associate director for computing and computational sciences, and point man for the NLCF, said, "The HPC Challenge benchmark suite is an important advance that will help us better understand which applications perform best on which HPC systems. The results available today are already useful. We look forward to widespread adoption of this new measurement tool by the global HPC community".

We will explore both what the HPCC benchmarks measure and how they might be combined to predict application performance.

  • Benchmarks Must Be Combined With Amdahl's Law

Is there a scientific theory of benchmarks? For example, is it possible in principle that a suite of benchmarks could be used to accurately predict the performance of a future system, much larger than systems currently in use, on a scientific application much larger than any currently being run? This is a tough question. The answer is, "probably not", unless somehow those benchmark results could be integrated into a larger performance-modeling framework that would allow their proper interpretation. Let us see why this is so.

One of the standard golden rules in any architect's toolkit is Amdahl's law. A typical (made-up) problem reads: Measurements show that, when program P is run on machine X, floating-point operations account for 60% of the run time, without affecting the operations that account for the other 40%. If the execution of floating-point operations is speeded up by a factor of 2, what is the total speed-up of program P? (Answer: 1.43). The implicit assumption here is that the floating-point operations represent a certain quantity of work that is performed, independently of the other work, at a certain computational rate, thus making a certain contribution to the total run time. If that computational rate is increased, and if the two types of operations do not overlap in time, then the contribution of floating-point operations to the new total run time is proportionally reduced and can be calculated exactly. Can this logically sound framework be carried over to benchmarks?

Let's try. Suppose, by some means, a customer has determined that 60% of the run time of his critical application is accounted for by computational work that is identifiably similar in its algorithmic characteristics to the work in benchmark A, while the remaining 40% is accounted for by work that is identifiably similar to the work in benchmark B, always assuming that the times to perform these two quantities of work do not overlap. Under these assumptions, the customer can indeed predict the relative performance of his critical application on two different machines by running both benchmarks on both machines and comparing the weighted sum. But how likely is it that the customer will know such things about his applications or that benchmarks that match specific sets of algorithmic characteristics will be available?

To estimate the execution time of a single application running on a single system, the calculation is: quantity of computational work 1 divided by computational rate 1 measured by benchmark 1 plus quantity of computational work 2 divided by computational rate 2 measured by benchmark 2 and so on, with appropriate fudging to deal with overlapping of different kinds of computational work.

One major problem is, if a customer is to be able to decompose an arbitrary application, saying that part of it is like one benchmark and part of it is like some other benchmark, and so on, how many benchmarks do we need to have available altogether? Since this question is impossible to answer in the abstract, we need to shift gears and start from a theory of performance. After this is straightened out and we know roughly how to predict application performance from benchmark results, we will say a few words about programming time, which is the other major component of time to solution.

  • A Theory Of Performance

Conventional high-end scientific computers are constructed by connecting a set of _nodes_, each containing multiple processors and memory, using a system _interconnection network_. If some computation can be localized to take place largely within individual nodes, then application performance is largely determined by the architecture of individual nodes. In contrast, if the computation requires significant global communication across nodes, then application performance is determined by both the architecture of individual nodes and the architecture of the interconnection network.

Consider computation within a node. Assume temporarily that all memory references must be satisfied from local memory: there is no help from the cache. We restate Little's law as follows: 1) arithmetic performance of one flop per cycle requires delivery to the processor of one memory-operand word per cycle (this is the operand bandwidth); 2) the operand bandwidth is the _minimum_ of the hardware bandwidth to local memory and the rate of issuing memory requests (this is the application bandwidth); and 3) the application bandwidth is equal to the number of outstanding memory-references, i.e., the memory-reference concurrency, divided by the memory latency.

With a strongly parallel processor, one can almost always succeed in raising the application bandwidth _above_ the hardware bandwidth, making hardware bandwidth the critical parameter limiting performance. The same cannot be said for (weakly parallel) commodity processors, which are limited in the number of outstanding memory references they can support. A typical commodity processor supports only 8 outstanding memory accesses. Given typical values of the other parameters in a COTS-based system, application bandwidth becomes the critical parameter limiting performance. (There is a long discussion of strongly and weakly parallel systems in HPCwire article [107765]).

Here are some numbers loosely modeled on the Intel Pentium 4 XeonMP processor. The hardware bandwidth to local memory is 0.25 words/cycle. The latency to local memory is 200 cycles. Assuming at most 8 outstanding memory references, the maximum application bandwidth, for _any_ application with single-word random memory accessing, is 0.04 words/cycle, which is 6.25 times _less_ than the hardware bandwidth.

Note that spatial data locality is not sufficient to hide latency---you need either data-reuse or producer-consumer temporal locality. For example, you can make the Stream sum-kernel loop A(i) = B(i) + C(i) completely spatially local, but it executes at only a fraction of peak performance because there is a cache miss every L iterations, where L is the length of the cache line, say, 16. Prefetching can cover L = 16 cycles of the latency, but 16 is much less than 200. With neither data-reuse nor producer-consumer temporal locality, a commodity processor idles most of the time. (Your correspondent may be somewhat generous in suggesting that commodity processors exploit all forms of producer-consumer temporal locality as effectively as possible).

So far, all limits on performance have been derived from parameters characterizing the individual-node architecture without considering particular applications. Let us now consider the effects of the D-cache. When data reuse is sufficiently high that most memory references can be satisfied from the cache, we get a cache multiplier effect that makes the _effective_ operand bandwidth more than the _actual_ operand bandwidth. For example, if the cache hit rate is 99%, then the arithmetic performance is 100 times the actual operand bandwidth, which is still the minimum of the hardware bandwidth and the application bandwidth (which, for our commodity processor, cannot exceed 0.04 words/cycle, for any application with random accessing).

  • Information From Local Microbenchmarks

Choosing a set of execution metrics, and the microbenchmarks to measure them, is making a decision about what are the important performance aspects of our codes. Obviously, applications vary in their patterns of memory access, communication, and synchronization. But which attributes should we measure? We will see that (most) benchmarks do not measure application-independent properties of architectures, although we can read about such things in hardware reference manuals. Rather, benchmark results mix together properties of the application, i.e., the benchmark, and properties of the architecture. We need a broad range of benchmarks to indirectly measure all the significant hardware mechanisms in an architecture that can lead to high performance.

The first supplementary execution metric adopted by HPCC is local unit-stride data-transfer rate, as measured by the Stream triad benchmark using all the processors and at least half the total memory of the system. The Stream triad-kernel loop body is A(i) = B(i) + scalar * C(i). This microbenchmark measures actual operand bandwidth with the Stream triad code. There is no temporal locality but plenty of spatial locality.

For weakly parallel processors, this benchmark measures how well the node's cache system works. Since the only latency tolerance comes from caches used spatially, we can estimate the concurrency, say, in words, as the number of outstanding cache misses times the number of words per miss. Unit stride makes this benchmark cache friendly: the processor can profit from all the outstanding words. Nevertheless, for weakly parallel processors, in all probability, the application (and hence operand) bandwidth will be somewhat to significantly less than the hardware bandwidth to local memory.

For strongly parallel processors, hardware parallelism mechanisms generate plenty of memory-reference concurrency independently of the cache, so in all probability the operand bandwidth will be essentially equal to the hardware bandwidth to local memory, in contrast to weakly parallel processors. In both cases, this cache-friendly benchmark measures the minimum of hardware bandwidth and application bandwidth for this specific form of local unit-stride memory accessing.

If we want an empirical determination of whether the hardware bandwidth is larger or smaller than the application bandwidth while executing the Stream triad benchmark on a given machine, then we need to run a supplementary benchmark (in some form). By itself, the Stream triad benchmark doesn't tell us the answer to this question. There are contexts in which this question is important. The simplest determination is to vary the memory-reference concurrency and observe its effect on performance: if you _can_ modify the performance, processor parallelism is the bottleneck.

Unit-stride memory accessing is cache friendly: prefetched words are in fact used by the processor. In contrast, random single-word memory references, such as gathers and scatters in sparse-matrix calculations, make the bandwidth problem (not enough bandwidth!) worse due to the granularity of cache lines. Since random memory accessing has no spatial and no temporal locality, it causes the effective operand bandwidth to be less than the actual operand bandwidth by a factor equal to the cache-line length L, again, say, 16.

The second supplementary execution metric adopted by HPCC is local random data-transfer rate, as measured by Gups run concurrently on as many equal-sized tables as the system has processors, with the memory arrays in the aggregate consuming at least half the total memory of the system. The Gups loop body makes random updates to randomly chosen memory locations. This benchmark has no spatial and no temporal locality.

For weakly parallel processors, this benchmark measures the operand bandwidth when the cache is severely stressed. As before, the only latency tolerance comes from caches used spatially, but now the concurrency is the number of outstanding cache misses times the _one_ word that is useful to the processor. If you will, the _effective_ concurrency has been divided by a factor of L, thereby reducing the _effective_ application bandwith by the same factor. This benchmark is cache unfriendly. Certainly, the application bandwidth will be significantly less than the hardware bandwidth to local memory.

For strongly parallel processors, again, hardware parallelism mechanisms generate plenty of memory-reference concurrency independently of the cache, so in all probability the operand bandwidth will be essentially equal to the hardware bandwidth to local memory, in sharp contrast to weakly parallel processors. The hardware for unit-stride accessing is slightly more efficient than the hardware for irregular-stride accessing, so we expect a small fall-off from the unit-stride case. Nevertheless, the fall-off from unit stride to irregular stide will be far less for a strongly parallel processor than for a weakly parallel processor. In both cases, this cache-unfriendly benchmark measures the minimum of hardware bandwidth and application bandwidth for this specific form of local irregular-stride memory accessing.

It may be useful to capture our theory of performance in a simple equation. Your correspondent likes:

                                  c(a,s)
    p(a,s) = l(a,s) * min { b(s), ------ }
                                  t(a,s)

That is, the arithmetic performance p(a,s) of application 'a' on system 's' is given by the product of a locality factor (cache multiplier) l(a, s), which may be either greater than or smaller than 1, and the minimum of the hardware bandwidth b(s) and the application bandwidth (no symbol), which is equal to the memory-reference concurrency c(a,s) divided by the memory latency t(a,s). Most of these quantities are application-, system-, and hence time-dependent, since an application can vary its algorithmic characteristics as it proceeds. (Obviously, average values over given temporal intervals exist).

Our goal is to accurately predict the performance of a full-scale application program using a compact formula based on easily measurable data. This may be possible if we model applications as going through (possibly overlapping) performance phases. A _performance phase_ is a computational interval during which some quantity of the _same kind_ of computational work is performed at a fixed computational rate. Both of the local microbenchmarks mentioned have this computational uniformity and measure fixed computational rates (constant values of p(a, s)). If the full-scale application is appropriately modeled, it may be possible to assign constant computational rates to each of its performance phases.

The intriguing question is whether, for this process, one needs to estimate the values of the quantities that _determine_ p(a,s).

  • Information From Global Microbenchmarks

Consider computation that requires significant communication across nodes. The first thing that changes in our theory of performance is that hardware bandwidth to local memory is replaced by hardware interconnection-network bisection bandwidth. The second thing that changes is that we are now concerned with the rate of issuing _global_ memory requests (again, this is the application bandwidth).

In the global case, the hardware underpinnings of application bandwidth can be more complex. For example, the Earth Simulator uses strongly parallel vector processors to issue memory requests (vector loads) within a node. The resulting memory-reference concurrency does not extend to communication across nodes, since that is performed by MPI. Strongly parallel processors are not _effectively_ strongly parallel when they communicate using MPI; however, there is a latency-tolerance effect when the messages are large. In both the scalar and the vector cases, machines that are clusters of SMP nodes and use MPI for global communication are likely to have less parallelism for global communication than they do for local communication whenever message size is stressed.

The third supplementary execution metric adopted by HPCC is global unit-stride data-transfer rate, as measured by transposing a square matrix that occupies at least half the total memory of the system. Specifically, the HPCC Ptrans microbenchmark implements parallel matrix transpose as A equals A plus the transpose of B, where matrices A and B are distributed across the processors. Like Stream triad, this microbenchmark has high spatial but low temporal locality. The communication pattern has pairs of processors exchanging large data blocks across the network simultaneously, so this measures the system bisection bandwidth. Actually, it measures operand bisection bandwidth, which is the minimum of hardware interconnection-network bisection bandwidth and application bandwidth. As before, the latter is equal to memory concurrency divided by memory latency.

For weakly parallel processors and/or processors that communicate globally using MPI, this benchmark measures how well large MPI messages tolerate latency. Since the only long-range latency tolerance comes from large messages, we can estimate the concurrency, say, in words, as _one_ outstanding MPI message times the number of words per message. Unit stride makes this benchmark MPI (and cache) friendly; in particular, large MPI messages are allowed by the benchmark. As a result, for large data transfers, the application (and hence operand) bandwidth will be essentially equal to the hardware interconnection-network bisection bandwidth. (When both processor overhead and network-transport latency are negligible, the data-transfer rate reduces to the hardware bisection bandwidth).

For strongly parallel processors that do not communicate globally using MPI, hardware parallelism mechanisms generate sufficient concurrency for long-range latency tolerance, so again the application (and hence operand) bandwidth will be essentially equal to the hardware interconnection-network bisection bandwidth, which matches processors that communicate globally using MPI.

The fourth supplementary execution metric adopted by HPCC is global random data-transfer rate, as measured by Gups run on a single memory array that consumes at least half the total memory of the system. The Gups loop body makes random updates to randomly chosen memory locations. This benchmark has no spatial and no temporal locality.

For weakly parallel processors and/or processors that communicate globally using MPI, this benchmark measures the operand bandwidth when MPI is severely stressed. As before, the only long-range latency tolerance comes from large messages, but now the concurrency is _one_ outstanding MPI message times _one_ word per message. The concurrency has been divided by an enormous factor, thereby reducing the application bandwidth. This benchmark is MPI (and cache) unfriendly. Certainly, the application (and hence operand) bandwidth will be significantly less than the hardware interconnection-network bisection bandwidth.

For strongly parallel processors that do not communicate globally using MPI, hardware parallelism mechanisms generate sufficient concurrency for long-range latency tolerance, so the application (and hence operand) bandwidth will be essentially equal to the hardware interconnection-network bisection bandwidth, in sharp contrast to processors that communicate globally using MPI.

These four microbenchmarks focus primarily on memory-access patterns and their effect, by modulating concurrency, on operand bandwidth. It is not clear whether one needs a separate benchmark for synchronization behavior. If so, the execution metric global sum reduction and broadcast should be considered. In the message-passing world, this can be measured by MPI Allreduce. In the shared-memory world, the corresponding benchmark is parallel list ranking. In general, attention should be paid to collective communication operations like broadcast, barrier synchronization, and parallel prefix computations, i.e., scans.

It is even less clear what programmability metrics should be. The ultimate goal is to predict the programming rate for developing application 'a' on hardware system 's' using programming system 'p'. Much has been written about the poor programmability of systems with weakly parallel processors (there is some discussion of this in HPCwire article [107765]). One of the main ideas is that an overly tight coupling of parallelism and locality in weakly parallel systems forces programmers to assume responsibility for the detailed global management of concurrent tasks and parallel resources. Many people feel that MPI itself is _programmer unfriendly_. Are there identifiable kinds of programming work (programming phases) for which benchmarks could be developed? Can one compensate by averaging for programmer variability? These are exceedingly difficult questions, but an attempt to quantify and measure the real differences in programmability that all of us feel must be made.

  • Putting It All Together

Imagine an idealized world in which each computational phase of a scientific application is completely specified by discrete values of coordinates along three orthogonal axes in locality space. In this view, memory-access patterns in a given phase are either local or global, are spatially local or not, and are temporally local or not. Of course, if an application is local in the sense that either all its memory references are nearby or the level of data reuse is quite high, then this counts as sort of both spatial and temporal locality.

Notionally, imagine the four quadrants (spatially local or not, temporally local or not) separately for local and global communication. If the phase has both spatial and temporal locality (local or global), use Linpack R_max as its performance bound. If the phase has temporal but no spatial locality (local or global), your correspondent is uncertain which benchmark to use. Everything else not temporally local follows neatly. For local communication, depending on the presence or absence of spatial locality, use Stream triad or local Gups as the performance bound. For global communication, depending on the presence or absence of spatial locality, use Ptrans or global Gups as the performance bound. Additional analytic and experimental work remains to be done to assess the predictive accuracy of this simplified approach.

From the systems perspective, we should recall the performance equation from earlier in this article:

                                  c(a,s)
    p(a,s) = l(a,s) * min { b(s), ------ }
                                  t(a,s)

In the present context, we probably aid intuition by _misinterpreting_ this formula: In those situations where hardware bandwidth b(s) is not the critical parameter limiting performance, take l(a,s) as the _arithmetic intensity_ (cache multiplier) due to temporal locality, and take c(a,s) as the _effective concurrency_, which, in some systems, can be stressed by irregular memory accessing. In this way, having temporal locality increases l(a,s), while not having spatial locality can decrease c(a,s). This allows you to estimate the degree of similarity of one of the computational phases of your application to a given HPCC benchmark, and to estimate both l(a,s) and the fall-off from best-case c(a,s) for this phase in your system.

The system designer is shown what is important. A computing system has some hardware bandwidth at both the local and global levels. The absence of spatial locality either does or does not stress its hardware parallelism mechanisms, at local and global levels. Its hardware locality mechanisms provide some or no performance boost in the presence of temporal locality. In practice, the successive computational phases of a scientific application may lie at various intermediate points in locality space. For this reason, a more thoroughgoing benchmark-based approach to predicting the performance of scientific applications may require estimation of the constituent factors in our performance equation.

The simplest punchline is that HPCC Linpack plus the four HPCC memory-access benchmarks give you data so that you can use the locality characteristics of the computational phases of your applications to predict when you will get Linpack-class performance from a given phase and when you will get much less. Ultimately, you will be able to predict the actual performance of full-scale applications.


The High-End Crusader, a noted expert in high-performance computing and communications, shall remain anonymous. He alone bears responsibility for these commentaries. Replies are welcome and may be sent to HPCwire editor Tim Curns at tim@hpcwire.com.


Top of Page

Previous Article   |  Table of Contents  |