HPCwire
 The global publication of record for High Performance Computing / January 21, 2005: Vol. 14, No. 3

  |  Table of Contents  |  

Cluster Computing:

10GbE SERVER NETWORKING: READY FOR PRIMETIME?
by Saqib Jang, Margalla Communications

High-performance cluster computing (HPCC) applications rely more on network performance than any other enterprise application category. Elapsed time needed to run large HPC batch applications may be fairly long (up to hours or days), which requires the network to deliver consistently high-performance and non-stop availability over long time periods. Therefore, designing Ethernet networks for cluster computing involves selection of network components, including server networking elements that have the performance and robustness needed to provide the optimum environment for successful execution of the distributed applications. The article focuses on a review of server networking products that support the latest generation of Ethernet technology, 10 gigabit Ethernet (10GbE), and how these address the evolving nature of HPCC requirements.

A number of recent developments (including Microsoft's impending entry into the HPCC space) illustrate the transition of HPCC from traditional research/scientific computing to commercial HPC computing and the increase in the HPC application domain. HPC cluster architectures now offer enterprises a high return on investment and competitive advantage by solving compute- intensive problems that were previously insolvable in a reasonably timely or cost-effective way. Commercial applications such as data mining are successfully using HPCC architectures. For example, oil companies use these clusters to perform seismic evaluations, oil reservoir simulations, and other tasks. The financial services industry uses these HPCC clusters to do portfolio forecasting. Bioinformatics and special effects rendering have witnessed a rapid adoption of HPCC.

Enterprise IT organizations are also converging on HPCC as a uniform architecture that can support the scalability and availability requirements of both data processing and technical computing applications. Requirements such as reliability, availability, and serviceability are also becoming important for all customers. Cluster architectures provide these features at a fraction of the cost of traditional approaches.

Networking: Foundation for HPCC Cluster ROI

An initial key requirement for developing a high ROI HPCC infrastructure is the ability to easily move and process very large datasets at high I/O rates. Typically, HPC clusters utilize Switched Gigabit Ethernet for client-to-server data movement.

Complicating this task is the trend that as disk drives double in size every 9 months, their performance capabilities lag in relation to performance innovation of CPU chips. As such, maintaining data flow to faster and faster server CPU's has become a critical storage networking requirement for an efficient and balanced HPC cluster computing infrastructure.

HPCC environments commonly deploy a SAN switching fabric, such as a fibre channel switch, or a Gigabit Ethernet-based NAS environment, which allows each cluster node to have high-performance access to a shared storage pool.

Second, HPCC cluster design should allow for a network infrastructure that provides predictable multi-gigabit bandwidth and low latency for cluster inter-process communications (IPC). Gigabit Ethernet networking supports data transfer rates at ~800 Mbps and latencies on the order of 60 microseconds, which is unsuitable for IPC communications, except for low-end cluster configurations.

While Gigabit Ethernet bandwidth is limited by wirespeed capacity, most of the latency incurred with Ethernet IPC is due to software protocol processing on servers. Another drawback of network I/O processing in software is that it results in high CPU utilization during intense network activity. This can be detrimental to the performance of the range of HPC applications requiring high levels of concurrent computation and IPC.

For this reason, traditionally, cluster IPC networking has required deployment of a separate, dedicated switching fabric, such as InfiniBand, Myrinet, or Quadrics, that can deliver multi-gigabit bandwidth and sub-10 microsecond communications, except for simple cluster configurations.

Enter 10 Gigabit Ethernet: Promising Convergence of HPC Networking

The latest generation of Ethernet networking, 10 Gigabit Ethernet (10GbE), is emerging as an excellent choice for HPC networking. 10 GbE switching has the potential of being a highly scalable and standard alternative for data movement and as an IPC fabric.

As more and more Gigabit Ethernet-based enterprise servers and applications continue to migrate to 10 GbE, the cost for 10GbE switch ports and server adapters is expected to continue to drop rapidly.

10GbE more than meets the bandwidth requirement for client-to-server data movement and as a switching fabric for cluster IPC communications. End-to-end latency for IPC over Gigabit Ethernet can also be expected to be reduced to sub-10 microsecond levels with the addition of TCP/IP Offload Engines (TOE), which execute TCP/IP processing in server NIC firmware/hardware, rather than in kernel software. 10GbE TOEs are also expected to greatly reduce CPU utilization, freeing the CPU for application processing and improving overall system performance.

In addition, 10GbE server NICs also support iSCSI and socket upper layer interfaces, enhancing the role that switched Ethernet can play as a single "converged" cluster fabric which meets the needs for IP communications, IPC, and NAS/SAN storage interconnect. A converged fabric allows high-performance clusters to be based on a single switching fabric in contrast to the more complex and costly approach of using proprietary fabrics for IPC and storage.

The first 10 GbE host adapters were introduced in the latter half of 2004. By evolving to address the gamut of Ethernet-based data center applications, 10 GbE NICs are expected reach sufficient volumes to ride cost reduction curves similar to those previously observed for Fast Ethernet and GbE NICs.

The available 10GbE server NIC products from Cheslio, Intel, and 2IO, can be classified into two broad categories based on their protocol offload mechanism. 10GbE NIC products from Intel and S2IO focus on a stateless, partial offload approach with support for features such as TCP checksum computation and TCP transmit segmentation, for delivering line-rate 10Gbps bandwidth.

Chelsio, on the other hand, offers 10GbE TOE NICs that fully offload network protocol processing from servers, while maintaining state for all data transfers. The full protocol offload approach goes beyond delivering line-rate 10Gbps throughput in providing latency and CPU utilization benefits.

10GbE Stateless Offload: Delivering High-Performance Standards-Based Networking for HPCC Data Movement

Intel announced its 2nd generation 10GbE server NIC product, Intel PRO/10GbE Server Adapter supporting the PCI-X 1.0 form factor, in May 2004. Single quantity list pricing for PRO/10GbE SR (for connectivity up to 300 meters) and LR (enabling connectivity up to 10 kilometers) PRO/10GbE models is $4770 and $7995, respectively.

While PRO/10GbE pricing is high by Gigabit NIC standards, Intel is witnessing growing use of the product in high-end HPCC deployments in research/laboratory/academic settings. PRO/100GbE NICs are being used to accelerate the delivery of multi-gigabyte visual simulation and rendering data sets resulting from complex applications that model, analyze, and predict the performance of drugs, particle physics, and environmental ecosystems. Such applications can easily migrate from GbE to 10GbE and are relatively price- insensitive.

"While mainstream use of 10GbE is still 1-2 years out, we're seeing a nice uptick in production deployment of our 10GbE NICs", says Steve Rotz, Product Line Marketing Manager for Intel's PRO/10GbE NIC products. "Our 10GbE products are completely based on Intel technology, use industry standards and are highly reliable."

In addition to end-user deployments within high-end HPCC environments, Intel is actively marketing its 10GbE NICs to the server OEM community. "We're very gratified that IBM has chosen to OEM our PRO/10GbE NICs", says Intel's Rotz.

Regarding the issue of reducing server network protocol processing overhead, Intel has disclosed general outlines of its "TCP acceleration" project without discussing product plan specifics. The project is designed to speed up the performance of existing server TCP/IP protocol stacks (as opposed to fully offloading network processing to a TOE NIC). Industry observers believe that Intel's TCP acceleration approach requires changes to the CPU, memory controller and Ethernet controller chips and are caution about of just how aggressively this approach will be deployed.

S2io Inc., a designer of ASICs and maker of network adapter cards, offers the Xframe 10GbE server NIC in short-reach and long-reach fiber versions. Pricing for Xframe is in the range of $4,990 for short-reach and $6,450 for long-reach fiber versions.

As with the Intel PRO/10GbE, Xframe provides stateless protocol offload and targets high-speed data movement in HPC clustering environments as the initial Xframe application.

As evidence of the success of its Xframe product in the market, S2IO points to recently concluded OEM deals with HP, SGI and Cray.

From an enhancement perspective, S2IO plans to shortly announce a new product that doubles the bandwidth available to Xframe-enabled 10GbE links. In addition, support for the Remote Direct Memory Access over TCP/IP (RDMA/TCP) standard is planned for 2006. RDMA/TCP will conserve memory bandwidth and reduce latency by eliminating kernel interrupts for copying message data between the network buffer pool and application buffers benefiting cluster IPC.

10GbE Full Protocol Offload: Delivering on the Promise of Converged Networking for HPC Clustering

Chelsio Communications, Inc., a developer of 10 Gigabit Ethernet ASIC-based adapter cards with protocol acceleration technology, employs both a stateful and full protocol offload approach in its Terminator line of 10GbE NICs. Chelsio's PCI-X compliant T110 TOE and N110 stateless offload NICs started shipping in May, 2004. Single quantity pricing for T110 ranges from $3995 for the short-reach version to $5500 for the long-reach version, while N110 pricing ranges from $2495 for the short-reach version to $2995 for the long- reach version.

The T110 is the first server NIC to offer a silicon-based 10G TOE and, as such, delivers demonstrable throughput, CPU utilization and latency benefits. It enables Gigabit Ethernet to go beyond enabling high-speed data movement and become a contender as a cluster IPC fabric within HPC environments.

Recently, Veritest published a benchmark report (available via the Veritest and Chelsio web sites) that shows the Chelsio T110 transmitting standard 1500- Byte Ethernet frames in a peer-to-peer configuration at 7.8Gb throughput with less than 10 microseconds user-to-user level process latency and 50% CPU utilization with a 2.2GHz AMD Opteron-based server.

"Published benchmarks show that Chelsio Communications has delivered the first 10G Ethernet adapter card that simultaneously achieves high throughput, low latency, and more importantly, low CPU utilization," says Kianoosh Naghshineh, Chelsio's president and CEO. "Our T110 TOE NIC truly makes the ubiquitous high-speed Ethernet feasible for HPCC interconnect applications."

The Chelsio T110 also implements iSCSI protocol acceleration and has been demonstrated to have performance advantages for SAN and NAS applications as well. Veritest testing has shown that the T110 can deliver 670K I/Os per second (IOPS) and over 800 MB/s throughout running in iSCSI target mode, which is significantly greater performance than that delivered by 2G or 4G Fibre Channel technology. TCP protocol processing overhead is also a significant concern for high-speed access to NAS storage, and Chelsio's T110 TOE NIC is beneficial here as well. T110's high NAS and SAN performance vis-a-vis FC was behind a recent OEM win (which Chelsio will be announcing later in 2005).

Chelsio thinks that the T110 TOE NIC is the first product that enables the use of 10GbE for fully converged HPCC networking, including data movement, IPC and storage communications. "The proven throughput, latency and scalability attributes of our adapters truly deliver network convergence for HPCC environments," says Chelsio's Naghshineh.

A range of high-end HPCC users in the research, academic, and commercial arenas are deploying Chelsio T110 for converged HPCC networking applications.

"Our testing shows that the 10-Gigabit Ethernet T110 adapter card simultaneously delivers high throughput and low latency, while keeping CPU utilization low by using their TCP offload engine (TOE)," said Wu Feng, team leader of research & development in Advanced Network Technologies (RADIANT) at Los Alamos National Laboratory. "We have also tested the T110 card with respect to scalability and found that the card easily supports hundreds of simultaneous connections with virtually no impact on the aggregate throughput."

Chelsio views its position as the only vendor offering a 10GbE TOE NIC and the proven nature of its technology as key differentiators for mission-critical HPC environments. Its T110 NIC enabled the University of Tokyo to win the annual Bandwidth Challenge competition at the Supercomputing 2004 conference.

"Achieving the world record required unique capabilities available only with Chelsio's T110 protocol engine" said Dr. Kei Hiraki of the University of Tokyo. "Our achievement required very high-speed, reliable TCP data transfer which could not have been realized without the flexibility and reliability of the T110 TOE protocol engine".

Regarding future plans, Chelsio's priorities are to focus on integration and cost reduction to drive 10GbE adoption, while continuing to lead the industry in protocol offload technology innovation.


Saqib Jang is founder and principal at Margalla Communications, a Woodside, CA-based strategic and technical marketing consulting firm focused on storage and server networking. He can be contacted at saqibj@margallacomm.com.


Top of Page

  |  Table of Contents  |