GETTING THE MOST FROM YOUR CLUSTER: TIPS AND HINTS
by Tim Curns, Editor
Maximizing cluster productivity is high priority among many in the HPC field.
Tom Quinn, Director of Government Business Development at Linux Networx,
recently gave HPCwire tips on determining proper cluster architecture, in-
house and vendor expertise, integration and compatibility of tools, System
Management and Administration Tools, programming tools, system installation,
and training. Mastery of these segments will help maximize a cluster's
HPCwire: When many people think of productivity and High Performance Computing
(HPC), they tend to think of performance. They may ask questions like how fast
is the system, where is it ranked among the world's fastest computers, etc.?
What's wrong with this thinking?
Tom Quinn: While performance is a significant factor in productivity, it is
certainly not the only contributor. The key point to focus on is not floating
point operations per second (flops), but rather how productive the machine
will be over its lifetime. The organization that is using the computing system
for business critical analyses, or key research, must have a machine that can
achieve maximum sustained performance over its life to provide the highest
return on investment (ROI) possible.
HPCwire: Is this a universally held belief? If so, why are there lists of the
fastest systems in the world still?
TQ: It's always interesting to see what supercomputers can claim bragging
rights as the fastest system in the world; however, there is growing
discussion in the HPC community that there needs to be a better standard for
ranking supercomputers. Most users are more concerned about productivity-or
the number of jobs run over the life of a system-rather than a machine's
ability to run a certain benchmark.
HPCwire: So what are some factors that need to be considered to help optimize
a system's productivity?
TQ: One of the most significant impacts on productivity is the system's
architecture and the application or set of applications for which the system
is being purchased. A detailed analysis of the intended applications'
requirements is paramount in order to determine and implement a system that
will provide the best overall productivity. Every aspect of the application(s)
should be evaluated including its memory footprint and bandwidth requirements,
interprocess communications with respect to both latency and bandwidth,
floating point/integer needs, code structure, programming language, current
state of parallelism, as well as its data requirements (access, amount, file
systems, global vs. local, and consistency). It is critical to perform this
analysis for all applications intended to be run on the system. Typically,
this analysis will result in bottle necks or constraining requirements that
will help determine the best system design trades.
HPCwire: What kinds of challenges face those who attempt to perform these
TQ: A common and reasonable approach to evaluate system productivity is to run
the actual application(s) the system is being procured for on each of the
proposed architectures; however, this isn't feasible in many cases. To
overcome this challenge, some organizations will require a significant number
of benchmarks that attempt to represent an application(s) profile to be run by
each vendor for each proposed configuration. This approach also has its
drawbacks, and many times yields a false sense of security. A combination of
running both application-based benchmarks that accurately represent your
application(s) profile and having a detailed system level design evaluation
based on application requirements is most effective.
HPCwire: What about a system's supporting software and tools? How do these
factors affect productivity?
TQ: Many powerful tools are available for use in HPC and Linux clustering that
are specifically designed to provide insight into the run time performance and
behavior of an application. Having the right tool may provide you with
information for optimizing and re-structuring your algorithm or code that
would have otherwise gone undiscovered. The price/productivity return on your
computing investment could be substantial. For example, you purchase a tool
that costs $1,000 (or embrace a free tool that, for this example, takes two
man days longer to learn) that uncovers an inefficient loop in your code. You
then unroll the loop yielding a 15% increase in performance. For most systems,
even small ones, that is an excellent price/productivity trade.
HPCwire: What role do management tools play in improving a system's
TQ: Having the right management tools are a vital part of a productive system
as they lower the total cost of system ownership and maximize the return on
your investment through simplified system administration. There are several
different free tools designed for Linux clustering at various stages of
development. Some vendors have their own cluster management tools that are
integrated into production systems. Linux Networx has designed Clusterworx and
Icebox, a comprehensive suite of management tools, to simplify cluster
administration, improve cluster utilization and efficiency, and to save
organizations time and money.
HPCwire: What should organizations consider when researching tools for a
TQ: It is crucial to consider both in-house and vendor expertise when
considering operating systems, supporting software and tools, management
software, and other components of the system's software architecture. For
example, don't buy tools you aren't going to be able to use or have a set of
tools installed that neither you nor the vendor know. This is particularly
important when you embrace open source tools and software. Developers need to
consider if the software is backed by a company or has a stable maintainer and
development base, reliable support, and whether or not there is enough
internal expertise to maintain the source code should the original group
change direction. These are all very serious points to consider to ensure long
HPCwire: Since the lifetime of HPC systems are limited, what are some actions
organizations can take to squeeze the most productivity from a system during
TQ: To maximize productivity means having your system up and running as much
as possible. One factor that can adversely affect this is the actual system
installation and configuration. All the time that a system is not available
for use from the time you pay for it, is incredibly expensive. If, for
example, a system has a projected lifespan of three years and after it is
received it takes three months to have it installed (which is not unheard of),
configured and running your production codes, that three months equates to 8%
reduction of useful productivity of the system. Another way to look at it is
based on Moore's law. Moore's law indicates that systems will double in speed
every 18 months. A three month delay essentially equates to an equivalent loss
of system performance of 8%. Delays are costly.
HPCwire: So what are some things that companies can do to prevent such delays?
TQ: Planning ahead to have a seamless fit and integration with existing IT
infrastructure such as networks, storage systems, system management functions,
and facilities will speed getting the system into production and reduce the
risk to ongoing operations. A serious but common mistake is to not
appropriately account for a facility's requirements as part of a procurement.
Systems have been known to be received at a site only to find out that there
was inadequate power or cooling to run them, resulting in significant delays
and expense. An important vendor consideration is to ascertain what their
post-sale problem resolution resources and processes are going to be. No one
and no systems are perfect, so it is better to be ready to deal with problems
ahead of time. It is also important to recognize the value of having systems
pre-staged, tested, and pre-configured before they show up. An advantage to
this is you can run codes and configure the system to your needs prior to
having it arrive on your floor, thus reducing the time to production.
Linux Networx has many processes in place in ensure a quick, efficient
delivery process. For example, Linux Networx built up an entire 2,816-
processor cluster system prior to delivering it to Los Alamos National
Laboratory. By building up the system beforehand and working out any issues
prior to delivery, Lab scientists were running benchmarks on the cluster just
fifteen days after the system was delivered.
HPCwire: How can a system vendor help maximize productivity?
TQ: HPC vendors need to be engaged over the lifetime of the system (not just
install it and run). To start, they should be actively involved with pre-
installation as part of the design process. They need to be experienced and
know what to do to (or not to do) in order to minimize the impact on ongoing
operations and get the system into production as soon as possible. The vendor
needs to examine and help plan facility requirements, provide sound project
planning and risk management, and be very actively involved post-installation.
This is an area frequently overlooked or not considered prior to procurement.
The vendor's involvement and support, ability to provide training and
education to the sys-admin team, availability of expertise and resources to
help port and optimize codes, and their ability to tap third party vendor
resources, are all necessary for the greatest productivity possible. Linux
Networx has developed an entire professional services program that is
dedicated to providing these services to our customers to ensure a productive
HPCwire: Many organizations spend time and money on training courses in the
hopes that this will help their admins boost system performance. In your
experience, is training a worthwhile investment for improving productivity?
TQ: Whenever a new system is purchased, in particular if there is minimal
institutional knowledge or experience with the types of systems being
considered, education is critical both before and after the system is
procured. Education is important for optimizing the procurement (making sure
you get the best solution for your money) and ensuring sustained productivity
after the system is installed. In terms of impact on price/productivity,
education for system administrators on the system's management tools is
difficult to prove quantitatively. It may yield reduction in down time and/or
enable them to get users running faster and more efficiently. For instance, a
developer could go to a three-day class, go back and look at their code, and
squeeze out an addition 20% performance. If the system cost $1 million and the
class cost $3,000 - that's an excellent price/productivity trade.
HPCwire: What does Linux Networx do differently to ensure customers get the
most productive systems possible?
TQ: To answer your question simply, Linux Networx provides all of the
technologies and services we've been discussing in this interview. For
example, to help customers increase the reliability and longevity of their
cluster systems, Linux Networx provides detailed simulations of airflow in
current and proposed datacenters using Airpak modeling software from Fluent.
The Airpak simulations reveal the airflow and temperature distribution in a
datacenter to better understand potential ventilation problems before a
computing system is installed. From the results of the Airpak simulations,
Linux Networx works with the customer in designing optimal sizing and
placement of racks and cooling equipment to maximize the utilization of the
heating, ventilation and air conditioning (HVAC). Airpak simulations are just
one of many services Linux Networx provides, such as total cluster management,
validation and integration of the latest components, full pre-ship system
buildup and testing, followed by rapid on-site installation, plus ongoing
service and training programs to help customers maximize their cluster's ROI.
All these efforts are designed to help customers get the most production
possible from their cluster during its lifetime.
HPCwire: What's on the horizon for Linux Networx?
TQ: Linux Networx will continue delivering highly productive computing systems
to the HPC marketplace and refining our industry-leading technologies such as
our cluster management tools, clustered storage, and Evolocity cluster
systems. Later this year, Linux Networx will be unveiling exciting new
technologies that will further elevate our efforts to deliver the most
productive computing systems available. We are also expanding our training and
professional services offerings, enabling customers to maximize ROI. With the
VC funding we recently received, Linux Networx is also expanding its reach
into markets that can benefit from high productivity computing and increase
our presence in North and South America, Europe, Middle East, Africa (EMEA)
and Asia Pacific. Basically, you can expect to see a lot of great things
coming from Linux Networx this year.
HPCwire: Sum up in a sentence what potential HPC cluster customers should
TQ: System productivity must be considered before, during and after
installation to achieve your organization's computing goals.
HPCwire: Thanks, Tom. We appreciate your thoughts, and I'm sure the
community will take note of your suggestions!
Prior to joining Linux Networx, Mr. Quinn was the director of business
development and operations for Scyld Computing Corporation. He also spent 10
years at the NASA Goddard Space Flight Center where he was responsible for
feasibility studies, conceptual design, implementation, fabrication,
integration and test of both flight and ground spacecraft systems. Amongst
many achievements, he was responsible for leading the successful reentry of
the Compton Gamma Ray Observatory in June 2000. Mr. Quinn earned a bachelor's
degree in electrical engineering from Pennsylvania State University and has
continued his education with graduate level studies focused on control theory
and computer architecture at the Johns Hopkins University.