
Cluster Computing:
10GbE SERVER NETWORKING: READY FOR PRIMETIME?
by Saqib Jang, Margalla Communications
High-performance cluster computing (HPCC) applications rely more on network
performance than any other enterprise application category. Elapsed time
needed to run large HPC batch applications may be fairly long (up to hours or
days), which requires the network to deliver consistently high-performance and
non-stop availability over long time periods. Therefore, designing Ethernet
networks for cluster computing involves selection of network components,
including server networking elements that have the performance and robustness
needed to provide the optimum environment for successful execution of the
distributed applications. The article focuses on a review of server networking
products that support the latest generation of Ethernet technology, 10 gigabit
Ethernet (10GbE), and how these address the evolving nature of HPCC
requirements.
A number of recent developments (including Microsoft's impending entry into
the HPCC space) illustrate the transition of HPCC from traditional
research/scientific computing to commercial HPC computing and the increase in
the HPC application domain. HPC cluster architectures now offer enterprises a
high return on investment and competitive advantage by solving compute-
intensive problems that were previously insolvable in a reasonably timely or
cost-effective way. Commercial applications such as data mining are
successfully using HPCC architectures. For example, oil companies use these
clusters to perform seismic evaluations, oil reservoir simulations, and other
tasks. The financial services industry uses these HPCC clusters to do
portfolio forecasting. Bioinformatics and special effects rendering have
witnessed a rapid adoption of HPCC.
Enterprise IT organizations are also converging on HPCC as a uniform
architecture that can support the scalability and availability requirements of
both data processing and technical computing applications. Requirements such
as reliability, availability, and serviceability are also becoming important
for all customers. Cluster architectures provide these features at a fraction
of the cost of traditional approaches.
Networking: Foundation for HPCC Cluster ROI
An initial key requirement for developing a high ROI HPCC infrastructure is
the ability to easily move and process very large datasets at high I/O rates.
Typically, HPC clusters utilize Switched Gigabit Ethernet for client-to-server
data movement.
Complicating this task is the trend that as disk drives double in size every 9
months, their performance capabilities lag in relation to performance
innovation of CPU chips. As such, maintaining data flow to faster and faster
server CPU's has become a critical storage networking requirement for an
efficient and balanced HPC cluster computing infrastructure.
HPCC environments commonly deploy a SAN switching fabric, such as a fibre
channel switch, or a Gigabit Ethernet-based NAS environment, which allows each
cluster node to have high-performance access to a shared storage pool.
Second, HPCC cluster design should allow for a network infrastructure that
provides predictable multi-gigabit bandwidth and low latency for cluster
inter-process communications (IPC). Gigabit Ethernet networking supports data
transfer rates at ~800 Mbps and latencies on the order of 60 microseconds,
which is unsuitable for IPC communications, except for low-end cluster
configurations.
While Gigabit Ethernet bandwidth is limited by wirespeed capacity, most of the
latency incurred with Ethernet IPC is due to software protocol processing on
servers. Another drawback of network I/O processing in software is that it
results in high CPU utilization during intense network activity. This can be
detrimental to the performance of the range of HPC applications requiring high
levels of concurrent computation and IPC.
For this reason, traditionally, cluster IPC networking has required deployment
of a separate, dedicated switching fabric, such as InfiniBand, Myrinet, or
Quadrics, that can deliver multi-gigabit bandwidth and sub-10 microsecond
communications, except for simple cluster configurations.
Enter 10 Gigabit Ethernet: Promising Convergence of HPC Networking
The latest generation of Ethernet networking, 10 Gigabit Ethernet (10GbE), is
emerging as an excellent choice for HPC networking. 10 GbE switching has the
potential of being a highly scalable and standard alternative for data
movement and as an IPC fabric.
As more and more Gigabit Ethernet-based enterprise servers and applications
continue to migrate to 10 GbE, the cost for 10GbE switch ports and server
adapters is expected to continue to drop rapidly.
10GbE more than meets the bandwidth requirement for client-to-server data
movement and as a switching fabric for cluster IPC communications. End-to-end
latency for IPC over Gigabit Ethernet can also be expected to be reduced to
sub-10 microsecond levels with the addition of TCP/IP Offload Engines (TOE),
which execute TCP/IP processing in server NIC firmware/hardware, rather than
in kernel software. 10GbE TOEs are also expected to greatly reduce CPU
utilization, freeing the CPU for application processing and improving overall
system performance.
In addition, 10GbE server NICs also support iSCSI and socket upper layer
interfaces, enhancing the role that switched Ethernet can play as a single
"converged" cluster fabric which meets the needs for IP communications, IPC,
and NAS/SAN storage interconnect. A converged fabric allows high-performance
clusters to be based on a single switching fabric in contrast to the more
complex and costly approach of using proprietary fabrics for IPC and storage.
The first 10 GbE host adapters were introduced in the latter half of 2004. By
evolving to address the gamut of Ethernet-based data center applications, 10
GbE NICs are expected reach sufficient volumes to ride cost reduction curves
similar to those previously observed for Fast Ethernet and GbE NICs.
The available 10GbE server NIC products from Cheslio, Intel, and 2IO, can be
classified into two broad categories based on their protocol offload
mechanism. 10GbE NIC products from Intel and S2IO focus on a stateless,
partial offload approach with support for features such as TCP checksum
computation and TCP transmit segmentation, for delivering line-rate 10Gbps
bandwidth.
Chelsio, on the other hand, offers 10GbE TOE NICs that fully offload network
protocol processing from servers, while maintaining state for all data
transfers. The full protocol offload approach goes beyond delivering line-rate
10Gbps throughput in providing latency and CPU utilization benefits.
10GbE Stateless Offload: Delivering High-Performance Standards-Based
Networking for HPCC Data Movement
Intel announced its 2nd generation 10GbE server NIC product, Intel PRO/10GbE
Server Adapter supporting the PCI-X 1.0 form factor, in May 2004. Single
quantity list pricing for PRO/10GbE SR (for connectivity up to 300 meters) and
LR (enabling connectivity up to 10 kilometers) PRO/10GbE models is $4770 and
$7995, respectively.
While PRO/10GbE pricing is high by Gigabit NIC standards, Intel is witnessing
growing use of the product in high-end HPCC deployments in
research/laboratory/academic settings. PRO/100GbE NICs are being used to
accelerate the delivery of multi-gigabyte visual simulation and rendering data
sets resulting from complex applications that model, analyze, and predict the
performance of drugs, particle physics, and environmental ecosystems. Such
applications can easily migrate from GbE to 10GbE and are relatively price-
insensitive.
"While mainstream use of 10GbE is still 1-2 years out, we're seeing a nice
uptick in production deployment of our 10GbE NICs", says Steve Rotz, Product
Line Marketing Manager for Intel's PRO/10GbE NIC products. "Our 10GbE products
are completely based on Intel technology, use industry standards and are
highly reliable."
In addition to end-user deployments within high-end HPCC environments, Intel
is actively marketing its 10GbE NICs to the server OEM community. "We're very
gratified that IBM has chosen to OEM our PRO/10GbE NICs", says Intel's Rotz.
Regarding the issue of reducing server network protocol processing overhead,
Intel has disclosed general outlines of its "TCP acceleration" project without
discussing product plan specifics. The project is designed to speed up the
performance of existing server TCP/IP protocol stacks (as opposed to fully
offloading network processing to a TOE NIC). Industry observers believe that
Intel's TCP acceleration approach requires changes to the CPU, memory
controller and Ethernet controller chips and are caution about of just how
aggressively this approach will be deployed.
S2io Inc., a designer of ASICs and maker of network adapter cards, offers the
Xframe 10GbE server NIC in short-reach and long-reach fiber versions. Pricing
for Xframe is in the range of $4,990 for short-reach and $6,450 for long-reach
fiber versions.
As with the Intel PRO/10GbE, Xframe provides stateless protocol offload and
targets high-speed data movement in HPC clustering environments as the initial
Xframe application.
As evidence of the success of its Xframe product in the market, S2IO points to
recently concluded OEM deals with HP, SGI and Cray.
From an enhancement perspective, S2IO plans to shortly announce a new product
that doubles the bandwidth available to Xframe-enabled 10GbE links. In
addition, support for the Remote Direct Memory Access over TCP/IP (RDMA/TCP)
standard is planned for 2006. RDMA/TCP will conserve memory bandwidth and
reduce latency by eliminating kernel interrupts for copying message data
between the network buffer pool and application buffers benefiting cluster
IPC.
10GbE Full Protocol Offload: Delivering on the Promise of Converged Networking
for HPC Clustering
Chelsio Communications, Inc., a developer of 10 Gigabit Ethernet ASIC-based
adapter cards with protocol acceleration technology, employs both a stateful
and full protocol offload approach in its Terminator line of 10GbE NICs.
Chelsio's PCI-X compliant T110 TOE and N110 stateless offload NICs started
shipping in May, 2004. Single quantity pricing for T110 ranges from $3995 for
the short-reach version to $5500 for the long-reach version, while N110
pricing ranges from $2495 for the short-reach version to $2995 for the long-
reach version.
The T110 is the first server NIC to offer a silicon-based 10G TOE and, as
such, delivers demonstrable throughput, CPU utilization and latency benefits.
It enables Gigabit Ethernet to go beyond enabling high-speed data movement and
become a contender as a cluster IPC fabric within HPC environments.
Recently, Veritest published a benchmark report (available via the Veritest
and Chelsio web sites) that shows the Chelsio T110 transmitting standard 1500-
Byte Ethernet frames in a peer-to-peer configuration at 7.8Gb throughput with
less than 10 microseconds user-to-user level process latency and 50% CPU
utilization with a 2.2GHz AMD Opteron-based server.
"Published benchmarks show that Chelsio Communications has delivered the first
10G Ethernet adapter card that simultaneously achieves high throughput, low
latency, and more importantly, low CPU utilization," says Kianoosh Naghshineh,
Chelsio's president and CEO. "Our T110 TOE NIC truly makes the ubiquitous
high-speed Ethernet feasible for HPCC interconnect applications."
The Chelsio T110 also implements iSCSI protocol acceleration and has been
demonstrated to have performance advantages for SAN and NAS applications as
well. Veritest testing has shown that the T110 can deliver 670K I/Os per
second (IOPS) and over 800 MB/s throughout running in iSCSI target mode, which
is significantly greater performance than that delivered by 2G or 4G Fibre
Channel technology. TCP protocol processing overhead is also a significant
concern for high-speed access to NAS storage, and Chelsio's T110 TOE NIC is
beneficial here as well. T110's high NAS and SAN performance vis-a-vis FC was
behind a recent OEM win (which Chelsio will be announcing later in 2005).
Chelsio thinks that the T110 TOE NIC is the first product that enables the use
of 10GbE for fully converged HPCC networking, including data movement, IPC and
storage communications. "The proven throughput, latency and scalability
attributes of our adapters truly deliver network convergence for HPCC
environments," says Chelsio's Naghshineh.
A range of high-end HPCC users in the research, academic, and commercial
arenas are deploying Chelsio T110 for converged HPCC networking applications.
"Our testing shows that the 10-Gigabit Ethernet T110 adapter card
simultaneously delivers high throughput and low latency, while keeping CPU
utilization low by using their TCP offload engine (TOE)," said Wu Feng, team
leader of research & development in Advanced Network Technologies (RADIANT) at
Los Alamos National Laboratory. "We have also tested the T110 card with
respect to scalability and found that the card easily supports hundreds of
simultaneous connections with virtually no impact on the aggregate
throughput."
Chelsio views its position as the only vendor offering a 10GbE TOE NIC and the
proven nature of its technology as key differentiators for mission-critical
HPC environments. Its T110 NIC enabled the University of Tokyo to win the
annual Bandwidth Challenge competition at the Supercomputing 2004 conference.
"Achieving the world record required unique capabilities available only with
Chelsio's T110 protocol engine" said Dr. Kei Hiraki of the University of
Tokyo. "Our achievement required very high-speed, reliable TCP data transfer
which could not have been realized without the flexibility and reliability of
the T110 TOE protocol engine".
Regarding future plans, Chelsio's priorities are to focus on integration and
cost reduction to drive 10GbE adoption, while continuing to lead the industry
in protocol offload technology innovation.
Saqib Jang is founder and principal at Margalla Communications, a Woodside,
CA-based strategic and technical marketing consulting firm focused on storage
and server networking. He can be contacted at saqibj@margallacomm.com.
|