COMPAQ TECHNOLOGY ENABLES COMPLETION OF HUMAN GENOME

New York, N.Y. -- When researchers at Celera Genomics, the Sanger Centre, the Whitehead Institute, and the National Institutes of Health announced that they had collaborated in completing the map of the human genome, they were highlighting a monumental scientific achievement made possible by high performance computing technology from Compaq Computer Corporation.

The supercomputers used by Celera Genomics, the Sanger Centre, and the Whitehead Institute were Compaq AlphaServers running Tru64 UNIX and TruCluster software.

"This is the first time in history that the human genetic code has been assembled in a linear fashion," J. Craig Venter, Celera's Chief Executive Officer, told USA Today. Celera had to assemble the 3.2 billion base pairs in their correct order, a computational challenge among the largest ever attempted. During the assembly process, Celera deployed more than 600 Alpha processors from Compaq capable of nearly a trillion operations per second. The final assembly computations were run on Compaq's new AlphaServer GS160 because the algorithms and data required 64 gigabytes of shared memory to run successfully.

Since the start of the human genome project in the early 1990's, Compaq has been providing tools to handle the staggering amount of data and computing power necessary to decipher the 3.2 billion "base pairs" that make up the genome - all the genes and related DNA.

Established by Wellcome Trust and the British Medical Research Council, the Sanger Centre in the United Kingdom needed a computing infrastructure that could meet the challenge of mapping and sequencing the human genome. According to Phil Butcher, Sanger's head of Information Technology, they based their choice on three steadfast design principles: scalability, adaptability and resiliency.

Sanger's initial configuration consisted of 160 Compaq Alpha workstations, four Compaq AlphaServer 1200 systems, and a small PC cluster running BLAST (basic local alignment search tool) to support public search access of their genomic databases over the Internet.

The Centre has continued to supplement their computing infrastructure, eventually acquiring approximately 250 Compaq AlphaServers and workstations running Tru64 UNIX. The Centre also employs a Compaq StorageWorks RAID system with four TeraBytes of disk space, a 300 GB Network Appliances RAID subsystem, and 48 Compaq Deskpro PCs.

Since 1990, Compaq has been at the forefront of developing and providing high performance computer architectures that meet the needs of the pharmaceutical, biotechnology and medical companies. "Today, it is increasingly difficult to separate the advances in biotechnology from advances in high performance computing," explained Ben Rosen, Chairman of the Board of Compaq during his keynote address at BIO '99. "In fact, some leading scientists believe that high-end computing is the future of biology and medicine & it will take increasingly powerful computers and software to gather, store, analyze, model and distribute information."

In 1998, Compaq was selected by Celera Genomics as their IT partner. Compaq designed and equipped Celera's data center, eventually installing and interconnecting nearly 700 CPUs and 70 TeraBytes of storage.

In 1999, Compaq created a Bioinformatics Expertise Center in Marlboro, Massachusetts to better support customers and business partners in the industry. Compaq's Cambridge, Massachusetts Research Laboratory also began a focus on bioinformatics, contributing to the optimization of applications performance and the development of data mining algorithms for genetic data.

In 1999, Compaq was selected by MIT's Whitehead Institute for Genomics Research to supply the IT infrastructure for their human genome efforts. Whitehead operates the largest public sequencing center in the United States, and is one of four key centers funded by NIH to complete the draft of the human genome. The Institute also relies on Compaq AlphaServer ES40s and Compaq StorageWorks to manage and analyze their genomic data.

Compaq's most recent contribution to the Human Genome Project is a cluster of AlphaServer ES40s with 100 CPUs and a TeraByte of storage located at Compaq's Enterprise System Lab in Littleton, Massachusetts that is being made available to the research institutions to complete the annotation of the human genome.


HPC & THE GENOME PROJECT: AN INTERVIEW WITH TY RABE
by Alan Beck, editor in chief

San Diego, CA -- Recent milestones in the area of mapping and annotating the human genome have created a groundswell of interest in this worldwide topic. Compaq's High Performance Technical Computing (HPTC) group has a long history in providing computing platforms for this area of scientific research, and recently, it was Compaq's Alpha systems that served as the computation engines for a number of research organizations, allowing scientists to make their recent breakthroughs.

HPCwire offers our readers this interview with Ty Rabe, Director of Compaq's High Performance Technical Computing Solutions group, to provide a better perspective on the computation side of the human genome project.

Q. How has Compaq contributed to the Human Genome Project?

A. Compaq is the primary supplier of systems to the bioinformatics industry. Most of the leading commercial and public genomic centers, such as Celera Genomics, Incyte Pharmaceuticals, the Sanger Centre, EBI, and the Whitehead Institute use Compaq Alpha systems. Our major pharmaceutical customers include Genentech, Novartis, SmithKline Beecham, and AstraZeneca.

Q. What is a typical configuration?

A. There is no typical configuration. The needs of each institution or facility differ. For the most part, organizations start out with a couple of systems then continue to add more as needs arise.

For instance, when the Sanger Centre, one of the world's premier institutions for genomic research, began working on the Human Genome Project in 1993, they relied on a few workstations. They soon realized that the workstations couldn't handle the amount of data, need for storage, network connectivity or tremendous computing power necessary to sequence the human genome.

Their initial configuration consisted of 160 Compaq Alpha workstations, four Compaq AlphaServer 1200 systems, and a small PC cluster using BLAST (basic local alignment search tool) to support public search access of their genomic databases over the Internet. The Centre has continued to supplement their computing infrastructure eventually acquiring Approximately 250 Compaq Alpha systems running on Tru64 UNIX software along with numerous Compaq AlphaServer systems, a Compaq StorageWorks RAID system with 3.5 terabytes of disk space and a 300 GB Network Appliances RAID subsystem, and 48 Compaq Deskpro PCs.

Q. When interviewed by CNN, Eric Lander, Director of the Whitehead Institute/MIT Center for Genome Research, commented "There's no need for a supercomputer to assemble the human genome sequence." Is that true?

A. Theoretically, "yes." Not in the traditional way of thinking of supercomputers. However, to reach their goals and for efficiency, research and pharmaceutical companies do use clusters of systems that effectively act as supercomputers. High-speed computers are necessary to analyze hundreds of terabytes of raw sequence data and correctly order over 3 billion pairs of bases.

Currently, three institutions affiliated with the public Human Genome Project are using a 100-CPU "biocluster" assembled by Compaq HPTC to assist with the initial steps in annotating the human genome identifying where the genes are located in the human chromosome.

Q. Who are the institutions?

A. They are the University of Santa Cruz, Whitehead Institute/MIT Center for Genome Research and Compaq's Cambridge Research Lab (CRL).

Q. What are they doing?

A. The University California at Santa Cruz is responsible for the layout and arranging to DNA pieces into the best version of the sequence. The Whitehead Institute/MIT Center for Genome Research will use the "bio-cluster" to catalog the repetitive DNA in the genome. There are approximately 100,000 genes in the human genome. The rest is comprised of repetitive, although conceivably just as important, code whose purpose has yet to be determined. Finally, the Cambridge Research Lab will find the exact location of the genes.

Q. Is Compaq charging for the use of the "biocluster?"

A. No. Compaq is dedicated to assisting the public researchers in reaching their goal to complete the sequencing and annotation. The public project came to Compaq and requested assistance because they didn't have the computing capacity to complete their work on schedule. Unlike commercial genomics labs, the public sector has limited funds. Compaq agreed to help by providing them with the tools and storage to complete the annotation later this year.

Q. Where is the "biocluster" located?

A. It's at the Compaq's Enterprise Systems Lab in Littleton, Mass.

Q. What are the components of the system?

A. Comprised of 25 ES40 SMP nodes each with 54 gigabytes of local storage and 4 Alpha CPUs (EV67, 667 MHz), the "biocluster" is networked together using 10/100 Ethernet. Twenty-four of the nodes have 4 gigabytes of RAM and one has 16 gigabytes of RAM. In addition, the system has a central file server with 1 terabyte of secondary storage. A standalone AlphaServer ES40 system is also available for testing scripts and any new user code before running on the main cluster.

Q. What type of software and databases are running on the system?

A. The software and databases that are being used in conjunction with the "biocluster" include public domain software from the University of Virginia (FASTA), Washington University (BLAST, HMMER), Southwest Parallel Software (Cross Match, SSearch), NCBI (BLAST), Whitehead/MIT (Genscan), and from other institutions is Repearmasker, clustalw, readseq, and mySQL.

The databases are Genbank, SwisProt and GenPept. CPLEX and Platform Computing are the commercial codes being used.

Q. Why is the Cambridge Research Lab (CRL) involved in using the "biocluster" to complete the annotation?

A. The Cambridge Research Lab, part of Compaq, is developing algorithms and systems to support large-scale genomic analysis. They're currently working on an efficient and high accuracy gene discovery software pipeline that will allow them to gain an insight into the efficiency of Compaq platforms to support genomic analysis, as well as, develop new computational ideas to improve the quality and speed of this analysis.

The computational tools that are under development include gene detection, functional genomics and comparatives genome analysis.

Q. Did CRL just begin conducting genome analysis?

A. No. Dr. Simon Kasif, one of the chief researchers with CRL, was initially with The Institute for Genomic Research where he worked with Dr. Craig Venter, the CEO of Celera Genomics. Dr. Kasif helped design and build Glimmer, one of the most widely used systems for microbial genome analysis. He's currently developing an active research program aimed at developing better systems for computational gene discovery. These discoveries will be shared with partners. CRL objective is to follow standard academic methodology and release procedures for the internally developed software to the general public.

Q. What other projects the "biocluster" be used for?

A. Initially, it will be used to conduct the annotation with other genomic projects to follows.

Q: Are there any requirements for using the "biocluster?"

A. The use of the "biocluster" is determined by the chief researchers of the public Human Genome Project (HGP). Once an organization has been approved to use the "biocluster", they're required to test their scripts and any new code on a standalone AlphaServer ES40 system before running it on the main cluster.

Q. How long has Compaq been involved in bioinformatics?

A. Since 1990, Compaq has been at the forefront of developing and providing high performance computer architectures that meet the needs of the pharmaceutical, biotechnology and medical companies. Ben Rosen, chairman of the board of Compaq during his keynote address at Bio '99 in Seattle discussed Compaq's commitment to bioinformatics explaining, "Today, it is increasingly difficult to separate the advances in biotechnology from advances in high performance computing. In fact, some leading scientists believe that high-end computing is the future of biology and medicine; it will take increasingly powerful computers and software to gather, store, analyze, model and distribute information."

Compaq enhanced their commitment to the biotechnology industry and bioinformatics through the acquisition of Digital, which introduced the first minicomputer the Digital PDP-8 thirty years ago. The PDP-8 subsequently became the mainstay of research laboratories around the world.

Q. Dr. Francis Collins, Director National Genomic Research Institute, commented that the Human Genome Project is more important than putting a man on the moon. How does bioinformatics compare?

A. The importance of each accomplishment can be debated. However, the computational requirements are far greater for the Human Genome Project. It is estimated that the doubling rate for genetic databases is six to eight months. Research organizations such as Celera Genomics and the Sanger Centre are already managing multiple terabytes of data, larger than the Library of Congress in size. This data will grow exponentially over the next few years.

Undoubtedly, the storage and computing resources required for the Human Genome Project are millions of times greater than used to land a man on the moon!

Q. If the crux of the sequencing and annotation is completed, why the continuing need for high performance systems?

A. The sequence or code of DNA is comparable to a parts list for an airplane or sophisticated piece of machinery. The parts list doesn't tell how the airplane or machine works. The same is true for DNA. Researchers now have the lengthy process of determining how each gene works. The demand for computational power will significantly increase as they develop new diagnostics, drug therapies and new strategies and methods for identifying disease genes.

In addition, agricultural/chemical companies are investing in high performance systems to develop new strains of seeds or herbicide resistant crops.

Q. How is the human genome similar to crops?

A. A genome is a code. Just as there are similarities between animals and man, there are similarities in the DNA of plants, insects; in fact, in all living organisms. The first genome that was completed was for phi-X in 1977. It had approximately 5,000 base pairs compares to man's 3.2 million.

Compaq systems are used in a wide variety of agricultural, chemical and pharmaceutical companies analyzing and simulating living organisms.

Q. How is Compaq involved in pharmaceuticals?

A. Compaq has provided solutions to SmithKline Beecham (SKB), one of the leading pharmaceutical companies, to develop new drugs based on genomics. A longtime Compaq Alpha customer, they have invested significantly in Compaq systems for their US bioinformatics division alone. They currently have installed a TruCluster solution containing more than 100 Alpha CPUs for DNA research.

Genentech, Inc., a biotechnology pioneer, relies on Compaq AlphaServers systems as a key element in their IT infrastructure running everything from e-mail to high performance bioinformatics, protein and molecular biology applications. Both AstraZeneca Pharmaceuticals and Incye Genomics chose AlphaServer and StorageWorks systems to handle the complexity and volume of genomic data and to increase customer service and access.

Q. What are some of the IT challenges bioformatic customers face?

A. The key challenge is managing the vast amount of data in a 24X7 production environment. Celera Genomics has 700 interconnected Compaq Alpha processors. Each one can perform more than 250 billion sequence comparisons per hour. Every day, Celera processes 300,000 genomic fragments or 150 million base pairs. Celera currently has 70 terabytes of genomic data on line.

In addition to data management, genomic researchers need computational performance. As Celera says, "Speed matters - Discovery can't wait." They chose Compaq because we had the processors, 64-bit operating system, and very large memory configurations that ran their applications 10 times faster than our nearest competitor.

Q. Have other organizations run benchmarks or comparison tests on Compaq systems?

A. Yes, in a genetic sequence analysis benchmark run by the University of Nebraska Medical Center, a Compaq AlphaServer ES40 outperformed the competition by a factor of two or more in all the benchmark tests.

When The Institute for Genomic Research (TIGR) memory needs began to surpass the capabilities of their 2 Gigabyte 32-bit machines, they asked Compaq to port the TIGR Assembler code to 64-bit AlphaServer 8400. They ran the code through a UNIX translator with no modifications. The assembly code with a specified data set ran for 11 days on TIGR's 32-bit Sun machine without the code being completed. On the AlphaServer, it completed the task in 17 hours.

Compaq's Cambridge Research Labs later optimized the code, reducing the computing time to just 10 hours.

Q. Is there just one type of AlphaServer?

A. Compaq offers a complete choice of high performance technical computing solutions from the low end to the high end, Tru64 UNIX or Linux. The AlphaServer DS, ES, GS, and SC systems provide the performance of the latest Alpha microprocessors in scalable configurations from the low-cost, single processor AlphaServer DS10 to high-end, switched SMP systems such as the AlphaServer SC series supercomputer that supports up to 128 processors.

The new AlphaServer GS series supports true SMP programming across the full 32-cpu environment along with single cache-coherent shared memory of up to 256 Gigabytes and breakthrough system management, serviceability and availability features.

Q. What does the AlphaServer SC series supercomputer offer?

A. Compaq's first true supercomputer, the AlphaServer SC series was developed to meet the challenging ASCI requirements, which called for multi-Teraflops, performance, superior cost/performance and high availability with proven high-volume components. With the industry's most advanced, high-speed interconnect technology, the AlphaServerSC supercomputer systems are able to connect 64 to 128 AlphaServer symmetric multiprocessor nodes, managed as a single system.

Q. Where are AlphaServer SC supercomuters installed?

A. Last year, Compaq announcing at Supercomputing '99 the availability of the AlphaServer SC; the first one shipped September 1999 to Lawrence Livermore National Laboratory. Currently, the AlphaServer SC is #15 on the Top500 List with the following performance:

  • Processors: 512 x Alpha EV67 (667MHz)
  • Rmax: 607.6 GFLOPs
  • Rpeak: 683
  • Nmax: 200,000

Q. How many AlphaServers, outside of the SC series, have been sold?

A. The International Data Corporation (IDC) has determined that Compaq AlphaServer systems have moved into second place in market share in High Performance Technical Computing. Our total HPTC business including systems, storage, and services is well in excess of $2 billion per year.

Q. Storage seems to be a concern for biotechnology researchers. What types of storage does Compaq offer?

A. Compaq StorageWorks are valued for their expandability and redundant components (dual heads, controllers and power supplies) that ensure no single point of failure. Naturally, StorageWorks are scalable and flexible. In addition to Celera's 70 terabyte system, the Sanger Centre uses a StorageWorks RAID system that makes up the bulk (3.5 terabytes) of their disk storage, along with a 300 GB Network Appliances RAID subsystem.

The Institute for Genomic Research (TIGR) uses a StorageWorks Enterprise Storage Array (ESA) 1000 to handle their ever-increasing data storage requirements, including the backup of all sequence chromatogram files and text data.

Q. What about software?

A. Compaq has collaborations and joint development efforts with the leading academic and commercial software developers for bioinformatics. Working closely with original code developers, Compaq engineers optimize their code for Alpha systems. Currently, there are over sixty public domain codes available on the Alpha platform and many more are optimized for Alpha.

Oracle provides the leading database for bioinformatics, and is optimized for performance on Alpha and Compaq's Tru64 UNIX.

QUOTES:

Celera Genomics, Rockville, MD - "Why are we using Compaq's Tru64 UNIX on Alpha? Performance. Large memory configurations. The only platform that can delivery the performance. IN our benchmark tests, the next vendor was literally an order of magnitude slower with our application." Marshall Peterson, director of infrastructure technology.

Sanger Centre, Cambridge, England - "Unlike the smaller RAID system, which is bounded and has single components, the Compaq StorageWorks systems are expandable and have dual heads, controllers, and power supplies, so there is no single point of failure. The 64-bit technology driving Compaq Alpha and Compaq Tru64 UNIX systems is speeding sequencing of the human genome. Without access to a great deal of the very latest computing resources, we would not even be able to contemplate reaching our targets." Phil Butcher, head of information technology.

The Institute for Genomic Research (TIGR), Rockville, MD. - "The Alpha systems have improved our productivity dramatically. Our researchers can get their work done a lot faster. They have the flexibility to repeat an experiment or analysis on a given day that they used to have to wait a week to do." Bruce Vincent, manager of information technology.


For more information, please visit http://www.compaq.com/hpc.


Return to HPCwire