New York, N.Y. -- When researchers at Celera Genomics, the Sanger Centre,
the Whitehead Institute, and the National Institutes of Health announced that
they had collaborated in completing the map of the human genome, they were
highlighting a monumental scientific achievement made possible by high
performance computing technology from Compaq Computer Corporation.
The supercomputers used by Celera Genomics, the Sanger Centre, and the
Whitehead Institute were Compaq AlphaServers running Tru64 UNIX and TruCluster
software.
"This is the first time in history that the human genetic code has been
assembled in a linear fashion," J. Craig Venter, Celera's Chief Executive
Officer, told USA Today. Celera had to assemble the 3.2 billion base pairs in
their correct order, a computational challenge among the largest ever
attempted. During the assembly process, Celera deployed more than 600 Alpha
processors from Compaq capable of nearly a trillion operations per second.
The final assembly computations were run on Compaq's new AlphaServer GS160
because the algorithms and data required 64 gigabytes of shared memory to run
successfully.
Since the start of the human genome project in the early 1990's, Compaq has
been providing tools to handle the staggering amount of data and computing
power necessary to decipher the 3.2 billion "base pairs" that make up the
genome - all the genes and related DNA.
Established by Wellcome Trust and the British Medical Research Council, the
Sanger Centre in the United Kingdom needed a computing infrastructure that
could meet the challenge of mapping and sequencing the human genome. According
to Phil Butcher, Sanger's head of Information Technology, they based their
choice on three steadfast design principles: scalability, adaptability and
resiliency.
Sanger's initial configuration consisted of 160 Compaq Alpha workstations,
four Compaq AlphaServer 1200 systems, and a small PC cluster running BLAST
(basic local alignment search tool) to support public search access of their
genomic databases over the Internet.
The Centre has continued to supplement their computing infrastructure,
eventually acquiring approximately 250 Compaq AlphaServers and workstations
running Tru64 UNIX. The Centre also employs a Compaq StorageWorks RAID system
with four TeraBytes of disk space, a 300 GB Network Appliances RAID subsystem,
and 48 Compaq Deskpro PCs.
Since 1990, Compaq has been at the forefront of developing and providing
high performance computer architectures that meet the needs of the
pharmaceutical, biotechnology and medical companies. "Today, it is
increasingly difficult to separate the advances in biotechnology from advances
in high performance computing," explained Ben Rosen, Chairman of the Board of
Compaq during his keynote address at BIO '99. "In fact, some leading
scientists believe that high-end computing is the future of biology and
medicine & it will take increasingly powerful computers and software to
gather, store, analyze, model and distribute information."
In 1998, Compaq was selected by Celera Genomics as their IT partner. Compaq
designed and equipped Celera's data center, eventually installing and
interconnecting nearly 700 CPUs and 70 TeraBytes of storage.
In 1999, Compaq created a Bioinformatics Expertise Center in Marlboro,
Massachusetts to better support customers and business partners in the
industry. Compaq's Cambridge, Massachusetts Research Laboratory also began a
focus on bioinformatics, contributing to the optimization of applications
performance and the development of data mining algorithms for genetic data.
In 1999, Compaq was selected by MIT's Whitehead Institute for Genomics
Research to supply the IT infrastructure for their human genome efforts.
Whitehead operates the largest public sequencing center in the United States,
and is one of four key centers funded by NIH to complete the draft of the
human genome. The Institute also relies on Compaq AlphaServer ES40s and Compaq
StorageWorks to manage and analyze their genomic data.
Compaq's most recent contribution to the Human Genome Project is a cluster
of AlphaServer ES40s with 100 CPUs and a TeraByte of storage located at
Compaq's Enterprise System Lab in Littleton, Massachusetts that is being made
available to the research institutions to complete the annotation of the human
genome.
San Diego, CA -- Recent milestones in the area of mapping and annotating the
human genome have created a groundswell of interest in this worldwide topic.
Compaq's High Performance Technical Computing (HPTC) group has a long history
in providing computing platforms for this area of scientific research, and
recently, it was Compaq's Alpha systems that served as the computation engines
for a number of research organizations, allowing scientists to make their
recent breakthroughs.
HPCwire offers our readers this interview with Ty Rabe, Director of Compaq's
High Performance Technical Computing Solutions group, to provide a better
perspective on the computation side of the human genome project.
Q. How has Compaq contributed to the Human Genome Project?
A. Compaq is the primary supplier of systems to the bioinformatics industry.
Most of the leading commercial and public genomic centers, such as Celera
Genomics, Incyte Pharmaceuticals, the Sanger Centre, EBI, and the Whitehead
Institute use Compaq Alpha systems. Our major pharmaceutical customers include
Genentech, Novartis, SmithKline Beecham, and AstraZeneca.
Q. What is a typical configuration?
A. There is no typical configuration. The needs of each institution or
facility differ. For the most part, organizations start out with a couple of
systems then continue to add more as needs arise.
For instance, when the Sanger Centre, one of the world's premier
institutions for genomic research, began working on the Human Genome Project
in 1993, they relied on a few workstations. They soon realized that the
workstations couldn't handle the amount of data, need for storage, network
connectivity or tremendous computing power necessary to sequence the human
genome.
Their initial configuration consisted of 160 Compaq Alpha workstations, four
Compaq AlphaServer 1200 systems, and a small PC cluster using BLAST (basic
local alignment search tool) to support public search access of their genomic
databases over the Internet. The Centre has continued to supplement their
computing infrastructure eventually acquiring Approximately 250 Compaq Alpha
systems running on Tru64 UNIX software along with numerous Compaq AlphaServer
systems, a Compaq StorageWorks RAID system with 3.5 terabytes of disk space
and a 300 GB Network Appliances RAID subsystem, and 48 Compaq Deskpro PCs.
Q. When interviewed by CNN, Eric Lander, Director of the Whitehead
Institute/MIT Center for Genome Research, commented "There's no need for a
supercomputer to assemble the human genome sequence." Is that true?
A. Theoretically, "yes." Not in the traditional way of thinking of
supercomputers. However, to reach their goals and for efficiency, research and
pharmaceutical companies do use clusters of systems that effectively act as
supercomputers. High-speed computers are necessary to analyze hundreds of
terabytes of raw sequence data and correctly order over 3 billion pairs of
bases.
Currently, three institutions affiliated with the public Human Genome
Project are using a 100-CPU "biocluster" assembled by Compaq HPTC to assist
with the initial steps in annotating the human genome identifying where the
genes are located in the human chromosome.
Q. Who are the institutions?
A. They are the University of Santa Cruz, Whitehead Institute/MIT Center for
Genome Research and Compaq's Cambridge Research Lab (CRL).
Q. What are they doing?
A. The University California at Santa Cruz is responsible for the layout and
arranging to DNA pieces into the best version of the sequence. The Whitehead
Institute/MIT Center for Genome Research will use the "bio-cluster" to catalog
the repetitive DNA in the genome. There are approximately 100,000 genes in the
human genome. The rest is comprised of repetitive, although conceivably just
as important, code whose purpose has yet to be determined. Finally, the
Cambridge Research Lab will find the exact location of the genes.
Q. Is Compaq charging for the use of the "biocluster?"
A. No. Compaq is dedicated to assisting the public researchers in reaching
their goal to complete the sequencing and annotation. The public project came
to Compaq and requested assistance because they didn't have the computing
capacity to complete their work on schedule. Unlike commercial genomics labs,
the public sector has limited funds. Compaq agreed to help by providing them
with the tools and storage to complete the annotation later this year.
Q. Where is the "biocluster" located?
A. It's at the Compaq's Enterprise Systems Lab in Littleton, Mass.
Q. What are the components of the system?
A. Comprised of 25 ES40 SMP nodes each with 54 gigabytes of local storage
and 4 Alpha CPUs (EV67, 667 MHz), the "biocluster" is networked together using
10/100 Ethernet. Twenty-four of the nodes have 4 gigabytes of RAM and one has
16 gigabytes of RAM. In addition, the system has a central file server with 1
terabyte of secondary storage. A standalone AlphaServer ES40 system is also
available for testing scripts and any new user code before running on the main
cluster.
Q. What type of software and databases are running on the system?
A. The software and databases that are being used in conjunction with the
"biocluster" include public domain software from the University of Virginia
(FASTA), Washington University (BLAST, HMMER), Southwest Parallel Software
(Cross Match, SSearch), NCBI (BLAST), Whitehead/MIT (Genscan), and from other
institutions is Repearmasker, clustalw, readseq, and mySQL.
The databases are Genbank, SwisProt and GenPept. CPLEX and Platform
Computing are the commercial codes being used.
Q. Why is the Cambridge Research Lab (CRL) involved in using the
"biocluster" to complete the annotation?
A. The Cambridge Research Lab, part of Compaq, is developing algorithms and
systems to support large-scale genomic analysis. They're currently working on
an efficient and high accuracy gene discovery software pipeline that will
allow them to gain an insight into the efficiency of Compaq platforms to
support genomic analysis, as well as, develop new computational ideas to
improve the quality and speed of this analysis.
The computational tools that are under development include gene detection,
functional genomics and comparatives genome analysis.
Q. Did CRL just begin conducting genome analysis?
A. No. Dr. Simon Kasif, one of the chief researchers with CRL, was initially
with The Institute for Genomic Research where he worked with Dr. Craig Venter,
the CEO of Celera Genomics. Dr. Kasif helped design and build Glimmer, one of
the most widely used systems for microbial genome analysis. He's currently
developing an active research program aimed at developing better systems for
computational gene discovery. These discoveries will be shared with partners.
CRL objective is to follow standard academic methodology and release
procedures for the internally developed software to the general public.
Q. What other projects the "biocluster" be used for?
A. Initially, it will be used to conduct the annotation with other genomic
projects to follows.
Q: Are there any requirements for using the "biocluster?"
A. The use of the "biocluster" is determined by the chief researchers of the
public Human Genome Project (HGP). Once an organization has been approved to
use the "biocluster", they're required to test their scripts and any new code
on a standalone AlphaServer ES40 system before running it on the main cluster.
Q. How long has Compaq been involved in bioinformatics?
A. Since 1990, Compaq has been at the forefront of developing and providing
high performance computer architectures that meet the needs of the
pharmaceutical, biotechnology and medical companies. Ben Rosen, chairman of
the board of Compaq during his keynote address at Bio '99 in Seattle discussed
Compaq's commitment to bioinformatics explaining, "Today, it is increasingly
difficult to separate the advances in biotechnology from advances in high
performance computing. In fact, some leading scientists believe that high-end
computing is the future of biology and medicine; it will take increasingly
powerful computers and software to gather, store, analyze, model and
distribute information."
Compaq enhanced their commitment to the biotechnology industry and
bioinformatics through the acquisition of Digital, which introduced the first
minicomputer the Digital PDP-8 thirty years ago. The PDP-8 subsequently became
the mainstay of research laboratories around the world.
Q. Dr. Francis Collins, Director National Genomic Research Institute,
commented that the Human Genome Project is more important than putting a man
on the moon. How does bioinformatics compare?
A. The importance of each accomplishment can be debated. However, the
computational requirements are far greater for the Human Genome Project. It is
estimated that the doubling rate for genetic databases is six to eight months.
Research organizations such as Celera Genomics and the Sanger Centre are
already managing multiple terabytes of data, larger than the Library of
Congress in size. This data will grow exponentially over the next few years.
Undoubtedly, the storage and computing resources required for the Human
Genome Project are millions of times greater than used to land a man on the
moon!
Q. If the crux of the sequencing and annotation is completed, why the
continuing need for high performance systems?
A. The sequence or code of DNA is comparable to a parts list for an airplane
or sophisticated piece of machinery. The parts list doesn't tell how the
airplane or machine works. The same is true for DNA. Researchers now have the
lengthy process of determining how each gene works. The demand for
computational power will significantly increase as they develop new
diagnostics, drug therapies and new strategies and methods for identifying
disease genes.
In addition, agricultural/chemical companies are investing in high
performance systems to develop new strains of seeds or herbicide resistant
crops.
Q. How is the human genome similar to crops?
A. A genome is a code. Just as there are similarities between animals and
man, there are similarities in the DNA of plants, insects; in fact, in all
living organisms. The first genome that was completed was for phi-X in 1977.
It had approximately 5,000 base pairs compares to man's 3.2 million.
Compaq systems are used in a wide variety of agricultural, chemical and
pharmaceutical companies analyzing and simulating living organisms.
Q. How is Compaq involved in pharmaceuticals?
A. Compaq has provided solutions to SmithKline Beecham (SKB), one of the
leading pharmaceutical companies, to develop new drugs based on genomics. A
longtime Compaq Alpha customer, they have invested significantly in Compaq
systems for their US bioinformatics division alone. They currently have
installed a TruCluster solution containing more than 100 Alpha CPUs for DNA
research.
Genentech, Inc., a biotechnology pioneer, relies on Compaq AlphaServers
systems as a key element in their IT infrastructure running everything from
e-mail to high performance bioinformatics, protein and molecular biology
applications. Both AstraZeneca Pharmaceuticals and Incye Genomics chose
AlphaServer and StorageWorks systems to handle the complexity and volume of
genomic data and to increase customer service and access.
Q. What are some of the IT challenges bioformatic customers face?
A. The key challenge is managing the vast amount of data in a 24X7
production environment. Celera Genomics has 700 interconnected Compaq Alpha
processors. Each one can perform more than 250 billion sequence comparisons
per hour. Every day, Celera processes 300,000 genomic fragments or 150 million
base pairs. Celera currently has 70 terabytes of genomic data on line.
In addition to data management, genomic researchers need computational
performance. As Celera says, "Speed matters - Discovery can't wait." They
chose Compaq because we had the processors, 64-bit operating system, and very
large memory configurations that ran their applications 10 times faster than
our nearest competitor.
Q. Have other organizations run benchmarks or comparison tests on Compaq
systems?
A. Yes, in a genetic sequence analysis benchmark run by the University of
Nebraska Medical Center, a Compaq AlphaServer ES40 outperformed the
competition by a factor of two or more in all the benchmark tests.
When The Institute for Genomic Research (TIGR) memory needs began to surpass
the capabilities of their 2 Gigabyte 32-bit machines, they asked Compaq to
port the TIGR Assembler code to 64-bit AlphaServer 8400. They ran the code
through a UNIX translator with no modifications. The assembly code with a
specified data set ran for 11 days on TIGR's 32-bit Sun machine without the
code being completed. On the AlphaServer, it completed the task in 17 hours.
Compaq's Cambridge Research Labs later optimized the code, reducing the
computing time to just 10 hours.
Q. Is there just one type of AlphaServer?
A. Compaq offers a complete choice of high performance technical computing
solutions from the low end to the high end, Tru64 UNIX or Linux. The
AlphaServer DS, ES, GS, and SC systems provide the performance of the latest
Alpha microprocessors in scalable configurations from the low-cost, single
processor AlphaServer DS10 to high-end, switched SMP systems such as the
AlphaServer SC series supercomputer that supports up to 128 processors.
The new AlphaServer GS series supports true SMP programming across the full
32-cpu environment along with single cache-coherent shared memory of up to 256
Gigabytes and breakthrough system management, serviceability and availability
features.
Q. What does the AlphaServer SC series supercomputer offer?
A. Compaq's first true supercomputer, the AlphaServer SC series was
developed to meet the challenging ASCI requirements, which called for
multi-Teraflops, performance, superior cost/performance and high availability
with proven high-volume components. With the industry's most advanced,
high-speed interconnect technology, the AlphaServerSC supercomputer systems
are able to connect 64 to 128 AlphaServer symmetric multiprocessor nodes,
managed as a single system.
Q. Where are AlphaServer SC supercomuters installed?
A. Last year, Compaq announcing at Supercomputing '99 the availability of
the AlphaServer SC; the first one shipped September 1999 to Lawrence Livermore
National Laboratory. Currently, the AlphaServer SC is #15 on the Top500 List
with the following performance:
- Processors: 512 x Alpha EV67 (667MHz)
- Rmax: 607.6 GFLOPs
- Rpeak: 683
- Nmax: 200,000
Q. How many AlphaServers, outside of the SC series, have been sold?
A. The International Data Corporation (IDC) has determined that Compaq
AlphaServer systems have moved into second place in market share in High
Performance Technical Computing. Our total HPTC business including systems,
storage, and services is well in excess of $2 billion per year.
Q. Storage seems to be a concern for biotechnology researchers. What types
of storage does Compaq offer?
A. Compaq StorageWorks are valued for their expandability and redundant
components (dual heads, controllers and power supplies) that ensure no single
point of failure. Naturally, StorageWorks are scalable and flexible. In
addition to Celera's 70 terabyte system, the Sanger Centre uses a StorageWorks
RAID system that makes up the bulk (3.5 terabytes) of their disk storage,
along with a 300 GB Network Appliances RAID subsystem.
The Institute for Genomic Research (TIGR) uses a StorageWorks Enterprise
Storage Array (ESA) 1000 to handle their ever-increasing data storage
requirements, including the backup of all sequence chromatogram files and text
data.
Q. What about software?
A. Compaq has collaborations and joint development efforts with the leading
academic and commercial software developers for bioinformatics. Working
closely with original code developers, Compaq engineers optimize their code
for Alpha systems. Currently, there are over sixty public domain codes
available on the Alpha platform and many more are optimized for Alpha.
Oracle provides the leading database for bioinformatics, and is optimized
for performance on Alpha and Compaq's Tru64 UNIX.
Celera Genomics, Rockville, MD - "Why are we using Compaq's Tru64 UNIX on
Alpha? Performance. Large memory configurations. The only platform that can
delivery the performance. IN our benchmark tests, the next vendor was
literally an order of magnitude slower with our application." Marshall
Peterson, director of infrastructure technology.
Sanger Centre, Cambridge, England - "Unlike the smaller RAID system, which
is bounded and has single components, the Compaq StorageWorks systems are
expandable and have dual heads, controllers, and power supplies, so there is
no single point of failure. The 64-bit technology driving Compaq Alpha and
Compaq Tru64 UNIX systems is speeding sequencing of the human genome. Without
access to a great deal of the very latest computing resources, we would not
even be able to contemplate reaching our targets." Phil Butcher, head of
information technology.
The Institute for Genomic Research (TIGR), Rockville, MD. - "The Alpha
systems have improved our productivity dramatically. Our researchers can get
their work done a lot faster. They have the flexibility to repeat an
experiment or analysis on a given day that they used to have to wait a week to
do." Bruce Vincent, manager of information technology.