TeraGrid Software Strategy: E Pluribus Unum
By Patricia Kovatch, San Diego Supercomputer Center
TeraGrid: Background and Development
"Build and deploy the world's fastest distributed infrastructure
for open scientific research:" this is the mission of the TeraGrid project. It is
a ~$100M multi-year project, sponsored by the National Science
Foundation (NSF). It's the first joint multi-site supercomputing effort featuring
over 24 TeraFlops (TF) of initial combined compute power (13 TF of
Itanium2), one PetaByte of online storage and a 40 Gigabit per second
backbone. Additional sites are in the process of adding their resources to the
TeraGrid. A grid infrastructure was developed and deployed to help scientists
and researchers make use of these resources. Cosmology, weather
and geophysics applications make up the initial allocations on the
TeraGrid. Coordinating the efforts to manage the sheer amount of software,
operations and user requests has been an interesting challenge.
The NSF vision, as given in the Request For Proposal (RFP) for
the TeraGrid project, reads, "NSF seeks to open a pathway to future
computing, communications and information environments by creating a very
large-scale system that is part of the rapidly expanding computational grid."
The goals of the TeraGrid are to "enable new science by offering
new capabilities, build an extensible grid that can be grown and copied,
and provide an evolutionary pathway for current users." The
solicitation specified that the proposals must include distributed, multi-site
facilities with single site and "grid enabled"
high end compute capabilities connected via ultra high-speed networks. A distributed storage system with
both online and archival storage capabilities was required. Remote visualization
and a production-level quality of service were necessary components.
Many Major Research Equipment (MRE) projects have common
needs: geographically distributed instruments, Terabytes to PetaBytes of data,
data unification and sharing between multiple formats and remote sites,
high end computation (simulations, analysis and data mining), presentation
of results (visualization, virtual reality, etc.). Examples of these
projects include:
- Atacama Large Millimeter Array (ALMA)
- EarthScope (Structure and Evolution of the North American Continent)
- IceCube (High Energy Neutrino Detector)
- Laser Interferometer Gravitational Wave Observatory (LIGO)
- Network for Earthquake Engineering Simulation (NEES)
Considering the needs of these projects, we sought to create a
seamless environment for the geographically distributed scientists of the
TeraGrid. We decided to minimize differences in hardware and software. We
researched and tested ways to make resources available equally from each site.
We defined a unified user support operation and we developed
an automatic reporting and testing infrastructure to guarantee that
a certain service level of resources would be available to our scientists
and researchers.
January 1, 2004 marked the first day of production for the
four founding sites of the TeraGrid. To make this happen, these four
diverse sites had to define and implement common goals. Thirteen
working groups were ordained to flesh out the vision and make detailed
implementation plans. The working groups consisted of Clusters, Grid,
Networks, Performance Evaluation, User Services, Data, Operations,
Applications, Visualization, External Relations, Interoperability, Account
Management, and Security.
TeraGrid Founding Sites
San Diego Supercomputer Center (SDSC)
National Computational Science Alliance (NCSA)
Argonne National Laboratory (ANL)
California Institute of Technology (Caltech)
New Sites
Pittsburgh Supercomputer Center (PSC)
Indiana University
Purdue University
Texas Advanced Computing Center (TACC)
Oak Ridge National Laboratories (ORNL)
Corporate Partners
IBM, Intel, Qwest, Sun, Myricom, Oracle
|
TeraGrid: Software Management Strategies
The basic TeraGrid software stack consists of:
- A base operating system
- Applicable drivers (for example, the SDSC Itanium2 cluster has Myrinet, Qlogic (fibre channel), and SysKonnect Gigabit Ethernet drivers)
- Compilers (GNU and Intel for TeraGrid Itanium2 clusters)
- Message passing interfaces (MPICH, MPICH-GM (for Myrinet on Itanium2)
- Numerical libraries (FFTw, ScaLAPACK, etc.)
- Data formatting tools (HDF4 and HDF5)
- Archival storage and data collection access
- Database clients
- Parallel filesystem daemons
- Resource manager (Portable Batch System (PBS) on Itanium2)
- Globus Grid infrastructure
Many other applications, such as the Storage Resource Broker
(SRB), help the scientists manage their data sets for their research. In total, there
are over 100 separate pieces of software. Keeping all of these pieces of
software up-to-date and interoperable at many different sites can be unwieldy,
but there are several techniques TeraGrid employed to manage all of
these pieces.
Use of a Global Software
Repository Each site has representatives responsible for parts of
the software stack who are tasked with watching for updates to the
software (including security), testing these updates and checking in the
software, accompanying configuration files and build instructions to the global
software repository. Other sites then check out the programs and files from the
global software repository and install them locally. TeraGrid currently uses an open source repository tool
called Concurrent Versions Systems (CVS) to manage the checking in and
checking out of software versions of all of our software, from the kernel
to applications. Version changes are typically applied during one of
the regularly coordinated, scheduled (weekly) Preventive Maintenance
(PM) periods which are staggered between the sites so not all resources
are unavailable at the same time. This scheduling strategy allows
software (and hardware) upgrades to happen in a timely manner. Updates to the
software stack due to security concerns such as root exploits happen
immediately. Policies and procedures are in place
to prevent, protect, detect and handle security incidents.
Maintenance of Nodes The nodes are maintained using a
standard suite of tools for automatic and repeatable installation and
updating software, automatically parsing the logs, etc. Being able to update
and check the software on all the nodes (as well as install the nodes) is critical
to maintaining systems with multiple nodes.
Consistent User Environment To abstract away the local
dependencies at each site, agreed upon environmental variables are used to make users' scripts portable.
This enables users to experience a consistent environment regardless of which site
is accessed, an attribute called TeraGrid "roaming". There are more than
20 variables defined for scratch and parallel filesystems, library
paths, MPICH, etc. The software that manages setting the user
environment variables is called Softenv.
More information about Softenv is available at
http://www-unix.mcs.anl.gov/systems/software/msys/
.
Sophisticated Monitoring Tools One "at-a-glance" view of
each site's software versions and basic software functionalities helps
everyone view the status of the system immediately. This
"Inca Test Harness and Reporting
Framework" is a flexible software infrastructure
for automated testing, verification and monitoring of the TeraGrid
software stack and environment. Inca is composed of a set of "reporters"
that interact with the system and report status information, a "harness"
that provides basic control of the reporters and collection of information,
and "clients" that provide a web
interface to the information collected by the reporters. Currently over 900 pieces
of information about the software and environment are checked.
A reporter is a self-contained pluggable component implemented as
a script or executable that performs a test, benchmark or query and outputs
a result in XML. For example, a reporter can output package version
information or test whether a grid service is up
and available. Reporters can be written in any language and Perl and Python
API helper libraries are available to ease the process of writing reporters.
Currently, there are more than 100 reporters written with these API's. The
execution details of the reporters (frequency, inputs, etc.) are handled by the
harness. The harness contains a set of daemons which manage the
distributed execution of the reporters, collect
the data to a central location and publish the data into an information
service such as MDS2. Clients compare the resource information collected by
Inca against a representation of the
software stack and environment and display it on a TeraGrid web status page.
A version of the Inca Test Harness and Reporting
Framework is available for download at http://tech.teragrid.org/inca/
.
Software Change Request Process Anyone can submit a
request to add or update software to the TeraGrid software stack by
submitting a software change request. An organized plan for testing,
deployment and integration of the software is developed and implemented.
Test nodes at each site are connected and help with the interoperability testing.
Use of the Globus Grid Toolkit The toolkit helps create
a geographically distributed infrastructure between the sites.
Gx-map and CACL make it easier to manage
this infrastructure. CACL is an Open SSL based Certificate Authority
(CA) CLient system that issues digital certificates. It
automatically authenticates the identity of the
user using the username and password. The encrypted request is submitted to
the CA daemon, which decrypts it and authenticates the user in the same
way that ordinary login authentication is done. It then either issues a
certificate or a rejection notice to the CA CLient program. If the certificate is issued, it
is automatically placed in the user's home directory. There is a program that
the CA administrator can use to revoke certificates and create server or
service certificates. These commands can only be run by a system administrator on
the machine that provides the CA services. More information is available at http://www.sdsc.edu/CA.
Gx-map allows users to add their certificate automatically to the
grid-mapfile. This file maps a specified Distinguished Name (DN) to a
Unix account name. Gx-map propagates these changes automatically
between systems. An auxiliary tool called
gx-check-cacl-index automatically checks for new user certificates issued
or revoked by CACL and invokes the
gx-map command to request the appropriate updates. With
these systems, a user can obtain a digital certificate and update the
grid-mapfile entries on a number of systems
without manual intervention from systems administrators. This software can
be found at http://users.sdsc.edu/~kst/gx-map/
.
TeraGrid: Account Management and Accounting
To make it easy for scientists and researchers to get started on
the TeraGrid, there is one location that gives information about
getting accounts and allocations: http://teragrid.org/docs/guide_access.html
. Account requests funnel to the national,
peer-reviewed allocation committees. The National Resource Allocation
Committee (NRAC), Partnership Resource Allocation Committee (PRAC)
and
Alliance Allocations Board (AAB) meet several times a year and
allocate over ten million Service Units per year. A Service Unit (SU) is
approximately one Central Processing Unit (CPU) hour. Most allocations last for one
year, though multi-year allocations are now available to scientists with
multi-year projects. Start-up allocations can be granted by the
Development Allocations Committee (DAC) for up to 30,000 SUs. Once allocations
are approved, the project and account notifications are transported
between the remote sites and the TeraGrid Central Database (Postgresql)
located at SDSC via the Account Management Information Exchange
(AMIE) system. AMIE uses an XML format to
facilitate import and export between each site's local database and central
database. Soon Perl scripts will automatically transport and accept account
addition and deletion requests and allocation usage statistics from SDSC to the
other sites.
When a user submits a job to a TeraGrid machine, a wrapper
script around the job submission commands
(PBS or Globus) checks to see
whether the account has an allocation. The job runs if the account has
enough allocation to fulfill the job request. After the job completes, the
job information is logged. Then the TGAccounting
package, developed at SDSC, converts the resource
manager (PBS) and Globus logs into an
XML format compatible with the Global Grid Forum (GGF) Usage Record
Format and National Middleware Initiative accounting protocols. These scripts
run nightly. The XML is then parsed and imported into records in the local
site's database and finally transported via
AMIE to the TeraGrid Central Database where the allocation is deducted.
Soon the allocations will be fungible between sites.
TeraGrid: Operations and Unified Help Desk
Though the TeraGrid resources span several sites, the TeraGrid
Help Desk appears as one entity to the scientists and researchers who use
it. There is one phone number, one e-mail address (help@teragrid.org) and
one website (http://www.teragrid.org)
presented to the users. To support this, there is one ticket system that
accepts the trouble tickets via e-mail or web input. This ticket system has
been specially tailored by NCSA to quantify and track tickets for the TeraGrid.
It automatically notifies the representatives from each site when new
tickets arrive. The tickets can be assigned to a specific site or all sites depending
on the request in the ticket. Site representatives update tickets via
a webpage interface to the ticket system database.
Both NCSA and SDSC have round-the-clock operations centers
and trade off the monitoring to provide coverage on a 24x7 basis.
These centers have procedures for contacting people during off-hours for support
as needed.
Each TeraGrid site has an operations center with
monitors dedicated to displaying the status of the machines.
Clumon, software developed at NCSA, monitors
performance metrics from the nodes and queues and correlates data and other queue and
job information from the resource manager and scheduler. It provides a
near-real time, graphical display about the health of the system and the
processes running on it. Visit http://clumon.ncsa.uiuc.edu for more information. Another monitoring tool called
Ganglia is also used.
TeraGrid: Collaboration
Each site brings different expertise to the TeraGrid. Each site has
provided critical insight, ideas and implementation effort to the TeraGrid project.
Each site has contributed essential software development, testing and
troubleshooting. The sites have worked closely together to develop the design
and implementation of the TeraGrid facilitated by weekly conference
calls and periodic face-to-face meetings.
TeraGrid: Futures
Research and demonstration of new technologies that improve
upon the current implementation are continually being conducted.
Specific areas include automatic scheduling of resources at multiple
sites (metascheduling), sharing home directories over the wide area
(parallel filesystems mounted over WAN) and remote manipulation of
storage facilities (Fibre Channel over IP protocol). In all of these cases, prototype programs have
been demonstrated at the Supercomputing conferences. For more information
on these demonstrations, please visit
http://www.sdsc.edu/~pkovatch/new-tech .
As new sites join on, new resources are available and
new interoperability and scalability issues are identified. Growth will continue
to provide new challenges and rewards to the scientists and researchers
making use of the ever-expanding TeraGrid.
About the Author
Patricia Kovatch is the Manager of the High Performance
Computing Production Archival Storage Systems at the San Diego Supercomputer
Center (SDSC). She is SDSC's lead for the Cluster and Grid Working Group.
She also serves on the Data and Performance Evaluation Working
Groups. She has worked on 512 processor commodity clusters with Myrinet
and currently Itanium2 clusters for the TeraGrid project.
Author Contact Information
Patricia Kovatch
Manager, HPC Systems and Archival Storage
San Diego Supercomputer Center
PH: (858)822-5441
E-mail: pkovatch@sdsc.edu |