
Features:
TOP500 AND HPCC BENCHMARKS -- WHAT THEY CAN AND CAN'T DO
Commentary from the High-End Crusader
A recent HPCwire article mused, would a new Chinese cluster computer with
2,000 AMD Opteron processors and a reported 10 TFs/s of peak performance make
it into the top 15 of the Top500 list? (The news hook here is the alleged
potential for nuclear-weapons design). After the requisite mention of the
Japanese Earth Simulator, which currently sits atop the list with 40 TFs/s of
peak performance, attention turned to other contenders. Would Cray's Red
Storm computer at Sandia make the November list? How would new clusters at
Livermore and Los Alamos do? Would the Cray leadership-class system at Oak
Ridge reach 100 and then 250 TFs/s? And would IBM's Blue Gene/L cluster at
Livermore reach 360 TFs/s?
All this is slightly curious. What exactly do such numbers mean? And why are
Linpack R_max ratings, which lie at the heart of the Top500 list, always such
sizeable fractions of system peak flops (88% for the Earth Simulator)?
There is no reason whatsoever to think that parallel architectures as diverse
as those mentioned in the first paragraph can usefully be measured by the
_single_ Linpack R_max benchmark. The typical (conventional) U.S. high-end
computer is a large-scale cluster of scalar SMP nodes. The Earth Simulator
is a cluster of vector SMP nodes. Both are shared memory within nodes and
message-passing distributed memory across nodes. In contrast, Red Storm is
a Linux multicomputer; its nodes are individual Opteron processors and local
memory. Red Storm is pure message-passing distributed memory across nodes.
Cray's X1 supercomputer, which---along with Red Storm---is an integral part
of the leadership-class system at Oak Ridge, is a distributed shared-memory
multiprocessor with vector SMP nodes. The Cray X1, and later the Cray X2,
will be components of the National Leadership Computing Facility (NLCF) at
Oak Ridge; both are shared memory within nodes and distributed shared memory
across nodes. Obviously, these widely diverse machines are appropriate for
computing widely diverse application/algorithm classes: they are good at
computing very different things!
Historically, high-end computing has painted itself into an awkward corner
by focusing inordinate attention on the relative performance of different
systems on the Linpack R_max benchmark, which is used to rank machines on the
Top500 list. Linpack is a benchmark from dense linear algebra. It has such
a high degree of producer-consumer temporal locality that it represents (is
typical of) only those applications that require significant floating-point
performance but need little in the way of memory access, communication, and
synchronization. If your critical application requires almost none of _any_
of these things, then perhaps _only_ the Linpack R_max performance number is
relevant to you.
Any specialist in performance modeling of high-end computers will tell you
that the field of high-end computing would benefit enormously from one or more
community-standard benchmarks to complement the scalable Linpack benchmark
that has been used for many years. One advantage of Linpack is that it can be
arbitrarily scaled to match the increased capabilities of newer systems. But
it emphasizes performance on regular, dense-matrix operations that possess
strong data locality, and ignores the realm of irregular, sparse-data access
typical of the majority of modern scientific-computing applications.
There is an undeniable political component to Linpack's longevity as the
primary execution metric. It can be expressed in one word: machoflops. After
providing for Iraq and homeland security, U.S. politicians are only willing to
spend money to procure "fast" computers. Unfortunately, _peak_ machoflops
(even R_max machoflops) are a very poor indicator of (actual) performance on
many important scientific applications.
Indeed, the Linpack R_max performance number, which is an _upper bound_ on
performance when there is _exceptional_ temporal locality, offers no insights
to the application developer into how the algorithmic characteristics of his
application will affect execution time, and no insights to the system designer
into which architectural characteristics of a computer system are important.
The smart money is now rallying in support of the HPC Challenge benchmark
suite, co-sponsored by DARPA's High-Productivity Computing Systems (HPCS)
program, DOE Office of Science, and NSF, and introduced by Jack Dongarra and
co-workers at SC2003. Linpack stays, but there are other benchmarks in the
suite that stress various hardware aspects of a system that govern the rate
at which memory operands can be supplied to processors.
Important user organizations have called for widespread adoption of the HPCC
benchmark suite. For example, Thomas Zacharia, ORNL's associate director for
computing and computational sciences, and point man for the NLCF, said, "The
HPC Challenge benchmark suite is an important advance that will help us better
understand which applications perform best on which HPC systems. The results
available today are already useful. We look forward to widespread adoption of
this new measurement tool by the global HPC community".
We will explore both what the HPCC benchmarks measure and how they might be
combined to predict application performance.
- Benchmarks Must Be Combined With Amdahl's Law
Is there a scientific theory of benchmarks? For example, is it possible in
principle that a suite of benchmarks could be used to accurately predict the
performance of a future system, much larger than systems currently in use, on
a scientific application much larger than any currently being run? This is a
tough question. The answer is, "probably not", unless somehow those benchmark
results could be integrated into a larger performance-modeling framework that
would allow their proper interpretation. Let us see why this is so.
One of the standard golden rules in any architect's toolkit is Amdahl's law.
A typical (made-up) problem reads: Measurements show that, when program P is
run on machine X, floating-point operations account for 60% of the run time,
without affecting the operations that account for the other 40%. If the
execution of floating-point operations is speeded up by a factor of 2, what
is the total speed-up of program P? (Answer: 1.43). The implicit assumption
here is that the floating-point operations represent a certain quantity of
work that is performed, independently of the other work, at a certain
computational rate, thus making a certain contribution to the total run time.
If that computational rate is increased, and if the two types of operations do
not overlap in time, then the contribution of floating-point operations to the
new total run time is proportionally reduced and can be calculated exactly.
Can this logically sound framework be carried over to benchmarks?
Let's try. Suppose, by some means, a customer has determined that 60% of the
run time of his critical application is accounted for by computational work
that is identifiably similar in its algorithmic characteristics to the work
in benchmark A, while the remaining 40% is accounted for by work that is
identifiably similar to the work in benchmark B, always assuming that the
times to perform these two quantities of work do not overlap. Under these
assumptions, the customer can indeed predict the relative performance of his
critical application on two different machines by running both benchmarks on
both machines and comparing the weighted sum. But how likely is it that the
customer will know such things about his applications or that benchmarks that
match specific sets of algorithmic characteristics will be available?
To estimate the execution time of a single application running on a single
system, the calculation is: quantity of computational work 1 divided by
computational rate 1 measured by benchmark 1 plus quantity of computational
work 2 divided by computational rate 2 measured by benchmark 2 and so on,
with appropriate fudging to deal with overlapping of different kinds of
computational work.
One major problem is, if a customer is to be able to decompose an arbitrary
application, saying that part of it is like one benchmark and part of it is
like some other benchmark, and so on, how many benchmarks do we need to have
available altogether? Since this question is impossible to answer in the
abstract, we need to shift gears and start from a theory of performance.
After this is straightened out and we know roughly how to predict application
performance from benchmark results, we will say a few words about programming
time, which is the other major component of time to solution.
Conventional high-end scientific computers are constructed by connecting a set
of _nodes_, each containing multiple processors and memory, using a system
_interconnection network_. If some computation can be localized to take place
largely within individual nodes, then application performance is largely
determined by the architecture of individual nodes. In contrast, if the
computation requires significant global communication across nodes, then
application performance is determined by both the architecture of individual
nodes and the architecture of the interconnection network.
Consider computation within a node. Assume temporarily that all memory
references must be satisfied from local memory: there is no help from the
cache. We restate Little's law as follows: 1) arithmetic performance of one
flop per cycle requires delivery to the processor of one memory-operand word
per cycle (this is the operand bandwidth); 2) the operand bandwidth is the
_minimum_ of the hardware bandwidth to local memory and the rate of issuing
memory requests (this is the application bandwidth); and 3) the application
bandwidth is equal to the number of outstanding memory-references, i.e., the
memory-reference concurrency, divided by the memory latency.
With a strongly parallel processor, one can almost always succeed in raising
the application bandwidth _above_ the hardware bandwidth, making hardware
bandwidth the critical parameter limiting performance. The same cannot be
said for (weakly parallel) commodity processors, which are limited in the
number of outstanding memory references they can support. A typical commodity
processor supports only 8 outstanding memory accesses. Given typical values
of the other parameters in a COTS-based system, application bandwidth becomes
the critical parameter limiting performance. (There is a long discussion of
strongly and weakly parallel systems in HPCwire article [107765]).
Here are some numbers loosely modeled on the Intel Pentium 4 XeonMP processor.
The hardware bandwidth to local memory is 0.25 words/cycle. The latency to
local memory is 200 cycles. Assuming at most 8 outstanding memory references,
the maximum application bandwidth, for _any_ application with single-word
random memory accessing, is 0.04 words/cycle, which is 6.25 times _less_ than
the hardware bandwidth.
Note that spatial data locality is not sufficient to hide latency---you need
either data-reuse or producer-consumer temporal locality. For example, you
can make the Stream sum-kernel loop A(i) = B(i) + C(i) completely spatially
local, but it executes at only a fraction of peak performance because there is
a cache miss every L iterations, where L is the length of the cache line, say,
16. Prefetching can cover L = 16 cycles of the latency, but 16 is much less
than 200. With neither data-reuse nor producer-consumer temporal locality, a
commodity processor idles most of the time. (Your correspondent may be
somewhat generous in suggesting that commodity processors exploit all forms of
producer-consumer temporal locality as effectively as possible).
So far, all limits on performance have been derived from parameters
characterizing the individual-node architecture without considering particular
applications. Let us now consider the effects of the D-cache. When data
reuse is sufficiently high that most memory references can be satisfied from
the cache, we get a cache multiplier effect that makes the _effective_ operand
bandwidth more than the _actual_ operand bandwidth. For example, if the cache
hit rate is 99%, then the arithmetic performance is 100 times the actual
operand bandwidth, which is still the minimum of the hardware bandwidth and
the application bandwidth (which, for our commodity processor, cannot exceed
0.04 words/cycle, for any application with random accessing).
- Information From Local Microbenchmarks
Choosing a set of execution metrics, and the microbenchmarks to measure them,
is making a decision about what are the important performance aspects of our
codes. Obviously, applications vary in their patterns of memory access,
communication, and synchronization. But which attributes should we measure?
We will see that (most) benchmarks do not measure application-independent
properties of architectures, although we can read about such things in
hardware reference manuals. Rather, benchmark results mix together properties
of the application, i.e., the benchmark, and properties of the architecture.
We need a broad range of benchmarks to indirectly measure all the significant
hardware mechanisms in an architecture that can lead to high performance.
The first supplementary execution metric adopted by HPCC is local unit-stride
data-transfer rate, as measured by the Stream triad benchmark using all the
processors and at least half the total memory of the system. The Stream
triad-kernel loop body is A(i) = B(i) + scalar * C(i). This microbenchmark
measures actual operand bandwidth with the Stream triad code. There is no
temporal locality but plenty of spatial locality.
For weakly parallel processors, this benchmark measures how well the node's
cache system works. Since the only latency tolerance comes from caches used
spatially, we can estimate the concurrency, say, in words, as the number of
outstanding cache misses times the number of words per miss. Unit stride
makes this benchmark cache friendly: the processor can profit from all the
outstanding words. Nevertheless, for weakly parallel processors, in all
probability, the application (and hence operand) bandwidth will be somewhat
to significantly less than the hardware bandwidth to local memory.
For strongly parallel processors, hardware parallelism mechanisms generate
plenty of memory-reference concurrency independently of the cache, so in all
probability the operand bandwidth will be essentially equal to the hardware
bandwidth to local memory, in contrast to weakly parallel processors. In
both cases, this cache-friendly benchmark measures the minimum of hardware
bandwidth and application bandwidth for this specific form of local
unit-stride memory accessing.
If we want an empirical determination of whether the hardware bandwidth is
larger or smaller than the application bandwidth while executing the Stream
triad benchmark on a given machine, then we need to run a supplementary
benchmark (in some form). By itself, the Stream triad benchmark doesn't tell
us the answer to this question. There are contexts in which this question is
important. The simplest determination is to vary the memory-reference
concurrency and observe its effect on performance: if you _can_ modify the
performance, processor parallelism is the bottleneck.
Unit-stride memory accessing is cache friendly: prefetched words are in fact
used by the processor. In contrast, random single-word memory references,
such as gathers and scatters in sparse-matrix calculations, make the bandwidth
problem (not enough bandwidth!) worse due to the granularity of cache lines.
Since random memory accessing has no spatial and no temporal locality, it
causes the effective operand bandwidth to be less than the actual operand
bandwidth by a factor equal to the cache-line length L, again, say, 16.
The second supplementary execution metric adopted by HPCC is local random
data-transfer rate, as measured by Gups run concurrently on as many
equal-sized tables as the system has processors, with the memory arrays in the
aggregate consuming at least half the total memory of the system. The Gups
loop body makes random updates to randomly chosen memory locations. This
benchmark has no spatial and no temporal locality.
For weakly parallel processors, this benchmark measures the operand bandwidth
when the cache is severely stressed. As before, the only latency tolerance
comes from caches used spatially, but now the concurrency is the number of
outstanding cache misses times the _one_ word that is useful to the processor.
If you will, the _effective_ concurrency has been divided by a factor of L,
thereby reducing the _effective_ application bandwith by the same factor.
This benchmark is cache unfriendly. Certainly, the application bandwidth will
be significantly less than the hardware bandwidth to local memory.
For strongly parallel processors, again, hardware parallelism mechanisms
generate plenty of memory-reference concurrency independently of the cache,
so in all probability the operand bandwidth will be essentially equal to the
hardware bandwidth to local memory, in sharp contrast to weakly parallel
processors. The hardware for unit-stride accessing is slightly more efficient
than the hardware for irregular-stride accessing, so we expect a small
fall-off from the unit-stride case. Nevertheless, the fall-off from unit
stride to irregular stide will be far less for a strongly parallel processor
than for a weakly parallel processor. In both cases, this cache-unfriendly
benchmark measures the minimum of hardware bandwidth and application bandwidth
for this specific form of local irregular-stride memory accessing.
It may be useful to capture our theory of performance in a simple equation.
Your correspondent likes:
c(a,s)
p(a,s) = l(a,s) * min { b(s), ------ }
t(a,s)
That is, the arithmetic performance p(a,s) of application 'a' on system 's' is
given by the product of a locality factor (cache multiplier) l(a, s), which
may be either greater than or smaller than 1, and the minimum of the hardware
bandwidth b(s) and the application bandwidth (no symbol), which is equal to
the memory-reference concurrency c(a,s) divided by the memory latency t(a,s).
Most of these quantities are application-, system-, and hence time-dependent,
since an application can vary its algorithmic characteristics as it proceeds.
(Obviously, average values over given temporal intervals exist).
Our goal is to accurately predict the performance of a full-scale application
program using a compact formula based on easily measurable data. This may be
possible if we model applications as going through (possibly overlapping)
performance phases. A _performance phase_ is a computational interval during
which some quantity of the _same kind_ of computational work is performed at
a fixed computational rate. Both of the local microbenchmarks mentioned have
this computational uniformity and measure fixed computational rates (constant
values of p(a, s)). If the full-scale application is appropriately modeled,
it may be possible to assign constant computational rates to each of its
performance phases.
The intriguing question is whether, for this process, one needs to estimate
the values of the quantities that _determine_ p(a,s).
- Information From Global Microbenchmarks
Consider computation that requires significant communication across nodes.
The first thing that changes in our theory of performance is that hardware
bandwidth to local memory is replaced by hardware interconnection-network
bisection bandwidth. The second thing that changes is that we are now
concerned with the rate of issuing _global_ memory requests (again, this is
the application bandwidth).
In the global case, the hardware underpinnings of application bandwidth can be
more complex. For example, the Earth Simulator uses strongly parallel vector
processors to issue memory requests (vector loads) within a node. The
resulting memory-reference concurrency does not extend to communication across
nodes, since that is performed by MPI. Strongly parallel processors are not
_effectively_ strongly parallel when they communicate using MPI; however,
there is a latency-tolerance effect when the messages are large. In both the
scalar and the vector cases, machines that are clusters of SMP nodes and use
MPI for global communication are likely to have less parallelism for global
communication than they do for local communication whenever message size is
stressed.
The third supplementary execution metric adopted by HPCC is global unit-stride
data-transfer rate, as measured by transposing a square matrix that occupies
at least half the total memory of the system. Specifically, the HPCC Ptrans
microbenchmark implements parallel matrix transpose as A equals A plus the
transpose of B, where matrices A and B are distributed across the processors.
Like Stream triad, this microbenchmark has high spatial but low temporal
locality. The communication pattern has pairs of processors exchanging large
data blocks across the network simultaneously, so this measures the system
bisection bandwidth. Actually, it measures operand bisection bandwidth, which
is the minimum of hardware interconnection-network bisection bandwidth and
application bandwidth. As before, the latter is equal to memory concurrency
divided by memory latency.
For weakly parallel processors and/or processors that communicate globally
using MPI, this benchmark measures how well large MPI messages tolerate
latency. Since the only long-range latency tolerance comes from large
messages, we can estimate the concurrency, say, in words, as _one_ outstanding
MPI message times the number of words per message. Unit stride makes this
benchmark MPI (and cache) friendly; in particular, large MPI messages are
allowed by the benchmark. As a result, for large data transfers, the
application (and hence operand) bandwidth will be essentially equal to the
hardware interconnection-network bisection bandwidth. (When both processor
overhead and network-transport latency are negligible, the data-transfer rate
reduces to the hardware bisection bandwidth).
For strongly parallel processors that do not communicate globally using MPI,
hardware parallelism mechanisms generate sufficient concurrency for long-range
latency tolerance, so again the application (and hence operand) bandwidth will
be essentially equal to the hardware interconnection-network bisection
bandwidth, which matches processors that communicate globally using MPI.
The fourth supplementary execution metric adopted by HPCC is global random
data-transfer rate, as measured by Gups run on a single memory array that
consumes at least half the total memory of the system. The Gups loop body
makes random updates to randomly chosen memory locations. This benchmark has
no spatial and no temporal locality.
For weakly parallel processors and/or processors that communicate globally
using MPI, this benchmark measures the operand bandwidth when MPI is severely
stressed. As before, the only long-range latency tolerance comes from large
messages, but now the concurrency is _one_ outstanding MPI message times _one_
word per message. The concurrency has been divided by an enormous factor,
thereby reducing the application bandwidth. This benchmark is MPI (and cache)
unfriendly. Certainly, the application (and hence operand) bandwidth will be
significantly less than the hardware interconnection-network bisection
bandwidth.
For strongly parallel processors that do not communicate globally using MPI,
hardware parallelism mechanisms generate sufficient concurrency for long-range
latency tolerance, so the application (and hence operand) bandwidth will be
essentially equal to the hardware interconnection-network bisection bandwidth,
in sharp contrast to processors that communicate globally using MPI.
These four microbenchmarks focus primarily on memory-access patterns and their
effect, by modulating concurrency, on operand bandwidth. It is not clear
whether one needs a separate benchmark for synchronization behavior. If so,
the execution metric global sum reduction and broadcast should be considered.
In the message-passing world, this can be measured by MPI Allreduce. In the
shared-memory world, the corresponding benchmark is parallel list ranking.
In general, attention should be paid to collective communication operations
like broadcast, barrier synchronization, and parallel prefix computations,
i.e., scans.
It is even less clear what programmability metrics should be. The ultimate
goal is to predict the programming rate for developing application 'a' on
hardware system 's' using programming system 'p'. Much has been written about
the poor programmability of systems with weakly parallel processors (there is
some discussion of this in HPCwire article [107765]). One of the main ideas
is that an overly tight coupling of parallelism and locality in weakly
parallel systems forces programmers to assume responsibility for the detailed
global management of concurrent tasks and parallel resources. Many people
feel that MPI itself is _programmer unfriendly_. Are there identifiable kinds
of programming work (programming phases) for which benchmarks could be
developed? Can one compensate by averaging for programmer variability? These
are exceedingly difficult questions, but an attempt to quantify and measure
the real differences in programmability that all of us feel must be made.
Imagine an idealized world in which each computational phase of a scientific
application is completely specified by discrete values of coordinates along
three orthogonal axes in locality space. In this view, memory-access patterns
in a given phase are either local or global, are spatially local or not, and
are temporally local or not. Of course, if an application is local in the
sense that either all its memory references are nearby or the level of data
reuse is quite high, then this counts as sort of both spatial and temporal
locality.
Notionally, imagine the four quadrants (spatially local or not, temporally
local or not) separately for local and global communication. If the phase
has both spatial and temporal locality (local or global), use Linpack R_max
as its performance bound. If the phase has temporal but no spatial locality
(local or global), your correspondent is uncertain which benchmark to use.
Everything else not temporally local follows neatly. For local communication,
depending on the presence or absence of spatial locality, use Stream triad or
local Gups as the performance bound. For global communication, depending on
the presence or absence of spatial locality, use Ptrans or global Gups as the
performance bound. Additional analytic and experimental work remains to be
done to assess the predictive accuracy of this simplified approach.
From the systems perspective, we should recall the performance equation from
earlier in this article:
c(a,s)
p(a,s) = l(a,s) * min { b(s), ------ }
t(a,s)
In the present context, we probably aid intuition by _misinterpreting_ this
formula: In those situations where hardware bandwidth b(s) is not the critical
parameter limiting performance, take l(a,s) as the _arithmetic intensity_
(cache multiplier) due to temporal locality, and take c(a,s) as the _effective
concurrency_, which, in some systems, can be stressed by irregular memory
accessing. In this way, having temporal locality increases l(a,s), while not
having spatial locality can decrease c(a,s). This allows you to estimate the
degree of similarity of one of the computational phases of your application
to a given HPCC benchmark, and to estimate both l(a,s) and the fall-off from
best-case c(a,s) for this phase in your system.
The system designer is shown what is important. A computing system has some
hardware bandwidth at both the local and global levels. The absence of
spatial locality either does or does not stress its hardware parallelism
mechanisms, at local and global levels. Its hardware locality mechanisms
provide some or no performance boost in the presence of temporal locality.
In practice, the successive computational phases of a scientific application
may lie at various intermediate points in locality space. For this reason,
a more thoroughgoing benchmark-based approach to predicting the performance
of scientific applications may require estimation of the constituent factors
in our performance equation.
The simplest punchline is that HPCC Linpack plus the four HPCC memory-access
benchmarks give you data so that you can use the locality characteristics of
the computational phases of your applications to predict when you will get
Linpack-class performance from a given phase and when you will get much less.
Ultimately, you will be able to predict the actual performance of full-scale
applications.
The High-End Crusader, a noted expert in high-performance computing and
communications, shall remain anonymous. He alone bears responsibility for
these commentaries. Replies are welcome and may be sent to HPCwire editor
Tim Curns at tim@hpcwire.com.
|