The global publication of record for High Performance Computing / January 28, 2005: Vol. 14, No. 4

  |  Table of Contents  |  

Cluster Computing:

by Tim Curns, Editor

Maximizing cluster productivity is high priority among many in the HPC field. Tom Quinn, Director of Government Business Development at Linux Networx, recently gave HPCwire tips on determining proper cluster architecture, in- house and vendor expertise, integration and compatibility of tools, System Management and Administration Tools, programming tools, system installation, and training. Mastery of these segments will help maximize a cluster's productivity.

HPCwire: When many people think of productivity and High Performance Computing (HPC), they tend to think of performance. They may ask questions like how fast is the system, where is it ranked among the world's fastest computers, etc.? What's wrong with this thinking?

Tom Quinn: While performance is a significant factor in productivity, it is certainly not the only contributor. The key point to focus on is not floating point operations per second (flops), but rather how productive the machine will be over its lifetime. The organization that is using the computing system for business critical analyses, or key research, must have a machine that can achieve maximum sustained performance over its life to provide the highest return on investment (ROI) possible.

HPCwire: Is this a universally held belief? If so, why are there lists of the fastest systems in the world still?

TQ: It's always interesting to see what supercomputers can claim bragging rights as the fastest system in the world; however, there is growing discussion in the HPC community that there needs to be a better standard for ranking supercomputers. Most users are more concerned about productivity-or the number of jobs run over the life of a system-rather than a machine's ability to run a certain benchmark.

HPCwire: So what are some factors that need to be considered to help optimize a system's productivity?

TQ: One of the most significant impacts on productivity is the system's architecture and the application or set of applications for which the system is being purchased. A detailed analysis of the intended applications' requirements is paramount in order to determine and implement a system that will provide the best overall productivity. Every aspect of the application(s) should be evaluated including its memory footprint and bandwidth requirements, interprocess communications with respect to both latency and bandwidth, floating point/integer needs, code structure, programming language, current state of parallelism, as well as its data requirements (access, amount, file systems, global vs. local, and consistency). It is critical to perform this analysis for all applications intended to be run on the system. Typically, this analysis will result in bottle necks or constraining requirements that will help determine the best system design trades.

HPCwire: What kinds of challenges face those who attempt to perform these analyses?

TQ: A common and reasonable approach to evaluate system productivity is to run the actual application(s) the system is being procured for on each of the proposed architectures; however, this isn't feasible in many cases. To overcome this challenge, some organizations will require a significant number of benchmarks that attempt to represent an application(s) profile to be run by each vendor for each proposed configuration. This approach also has its drawbacks, and many times yields a false sense of security. A combination of running both application-based benchmarks that accurately represent your application(s) profile and having a detailed system level design evaluation based on application requirements is most effective.

HPCwire: What about a system's supporting software and tools? How do these factors affect productivity?

TQ: Many powerful tools are available for use in HPC and Linux clustering that are specifically designed to provide insight into the run time performance and behavior of an application. Having the right tool may provide you with information for optimizing and re-structuring your algorithm or code that would have otherwise gone undiscovered. The price/productivity return on your computing investment could be substantial. For example, you purchase a tool that costs $1,000 (or embrace a free tool that, for this example, takes two man days longer to learn) that uncovers an inefficient loop in your code. You then unroll the loop yielding a 15% increase in performance. For most systems, even small ones, that is an excellent price/productivity trade.

HPCwire: What role do management tools play in improving a system's productivity?

TQ: Having the right management tools are a vital part of a productive system as they lower the total cost of system ownership and maximize the return on your investment through simplified system administration. There are several different free tools designed for Linux clustering at various stages of development. Some vendors have their own cluster management tools that are integrated into production systems. Linux Networx has designed Clusterworx and Icebox, a comprehensive suite of management tools, to simplify cluster administration, improve cluster utilization and efficiency, and to save organizations time and money.

HPCwire: What should organizations consider when researching tools for a particular system?

TQ: It is crucial to consider both in-house and vendor expertise when considering operating systems, supporting software and tools, management software, and other components of the system's software architecture. For example, don't buy tools you aren't going to be able to use or have a set of tools installed that neither you nor the vendor know. This is particularly important when you embrace open source tools and software. Developers need to consider if the software is backed by a company or has a stable maintainer and development base, reliable support, and whether or not there is enough internal expertise to maintain the source code should the original group change direction. These are all very serious points to consider to ensure long term productivity.

HPCwire: Since the lifetime of HPC systems are limited, what are some actions organizations can take to squeeze the most productivity from a system during its lifetime?

TQ: To maximize productivity means having your system up and running as much as possible. One factor that can adversely affect this is the actual system installation and configuration. All the time that a system is not available for use from the time you pay for it, is incredibly expensive. If, for example, a system has a projected lifespan of three years and after it is received it takes three months to have it installed (which is not unheard of), configured and running your production codes, that three months equates to 8% reduction of useful productivity of the system. Another way to look at it is based on Moore's law. Moore's law indicates that systems will double in speed every 18 months. A three month delay essentially equates to an equivalent loss of system performance of 8%. Delays are costly.

HPCwire: So what are some things that companies can do to prevent such delays?

TQ: Planning ahead to have a seamless fit and integration with existing IT infrastructure such as networks, storage systems, system management functions, and facilities will speed getting the system into production and reduce the risk to ongoing operations. A serious but common mistake is to not appropriately account for a facility's requirements as part of a procurement. Systems have been known to be received at a site only to find out that there was inadequate power or cooling to run them, resulting in significant delays and expense. An important vendor consideration is to ascertain what their post-sale problem resolution resources and processes are going to be. No one and no systems are perfect, so it is better to be ready to deal with problems ahead of time. It is also important to recognize the value of having systems pre-staged, tested, and pre-configured before they show up. An advantage to this is you can run codes and configure the system to your needs prior to having it arrive on your floor, thus reducing the time to production.

Linux Networx has many processes in place in ensure a quick, efficient delivery process. For example, Linux Networx built up an entire 2,816- processor cluster system prior to delivering it to Los Alamos National Laboratory. By building up the system beforehand and working out any issues prior to delivery, Lab scientists were running benchmarks on the cluster just fifteen days after the system was delivered.

HPCwire: How can a system vendor help maximize productivity?

TQ: HPC vendors need to be engaged over the lifetime of the system (not just install it and run). To start, they should be actively involved with pre- installation as part of the design process. They need to be experienced and know what to do to (or not to do) in order to minimize the impact on ongoing operations and get the system into production as soon as possible. The vendor needs to examine and help plan facility requirements, provide sound project planning and risk management, and be very actively involved post-installation. This is an area frequently overlooked or not considered prior to procurement. The vendor's involvement and support, ability to provide training and education to the sys-admin team, availability of expertise and resources to help port and optimize codes, and their ability to tap third party vendor resources, are all necessary for the greatest productivity possible. Linux Networx has developed an entire professional services program that is dedicated to providing these services to our customers to ensure a productive cluster system.

HPCwire: Many organizations spend time and money on training courses in the hopes that this will help their admins boost system performance. In your experience, is training a worthwhile investment for improving productivity?

TQ: Whenever a new system is purchased, in particular if there is minimal institutional knowledge or experience with the types of systems being considered, education is critical both before and after the system is procured. Education is important for optimizing the procurement (making sure you get the best solution for your money) and ensuring sustained productivity after the system is installed. In terms of impact on price/productivity, education for system administrators on the system's management tools is difficult to prove quantitatively. It may yield reduction in down time and/or enable them to get users running faster and more efficiently. For instance, a developer could go to a three-day class, go back and look at their code, and squeeze out an addition 20% performance. If the system cost $1 million and the class cost $3,000 - that's an excellent price/productivity trade.

HPCwire: What does Linux Networx do differently to ensure customers get the most productive systems possible?

TQ: To answer your question simply, Linux Networx provides all of the technologies and services we've been discussing in this interview. For example, to help customers increase the reliability and longevity of their cluster systems, Linux Networx provides detailed simulations of airflow in current and proposed datacenters using Airpak modeling software from Fluent. The Airpak simulations reveal the airflow and temperature distribution in a datacenter to better understand potential ventilation problems before a computing system is installed. From the results of the Airpak simulations, Linux Networx works with the customer in designing optimal sizing and placement of racks and cooling equipment to maximize the utilization of the heating, ventilation and air conditioning (HVAC). Airpak simulations are just one of many services Linux Networx provides, such as total cluster management, validation and integration of the latest components, full pre-ship system buildup and testing, followed by rapid on-site installation, plus ongoing service and training programs to help customers maximize their cluster's ROI. All these efforts are designed to help customers get the most production possible from their cluster during its lifetime.

HPCwire: What's on the horizon for Linux Networx?

TQ: Linux Networx will continue delivering highly productive computing systems to the HPC marketplace and refining our industry-leading technologies such as our cluster management tools, clustered storage, and Evolocity cluster systems. Later this year, Linux Networx will be unveiling exciting new technologies that will further elevate our efforts to deliver the most productive computing systems available. We are also expanding our training and professional services offerings, enabling customers to maximize ROI. With the VC funding we recently received, Linux Networx is also expanding its reach into markets that can benefit from high productivity computing and increase our presence in North and South America, Europe, Middle East, Africa (EMEA) and Asia Pacific. Basically, you can expect to see a lot of great things coming from Linux Networx this year.

HPCwire: Sum up in a sentence what potential HPC cluster customers should always remember.

TQ: System productivity must be considered before, during and after installation to achieve your organization's computing goals.

HPCwire: Thanks, Tom. We appreciate your thoughts, and I'm sure the community will take note of your suggestions!

Prior to joining Linux Networx, Mr. Quinn was the director of business development and operations for Scyld Computing Corporation. He also spent 10 years at the NASA Goddard Space Flight Center where he was responsible for feasibility studies, conceptual design, implementation, fabrication, integration and test of both flight and ground spacecraft systems. Amongst many achievements, he was responsible for leading the successful reentry of the Compton Gamma Ray Observatory in June 2000. Mr. Quinn earned a bachelor's degree in electrical engineering from Pennsylvania State University and has continued his education with graduate level studies focused on control theory and computer architecture at the Johns Hopkins University.

Top of Page

  |  Table of Contents  |