Petascale Virtual-Data Grids for Data Intensive Science

 

 
 

The GriPhyN Collaboration
http://www.phys.ufl.edu/~avery/griphyn/white_paper.html

 

 
 

Introduction

A new generation of experiments is under construction that, when operational in the next few years, will usher in the most comprehensive program of study ever attempted of the four fundamental forces of nature and the structure of the universe. The experiments include particle physics detectors that will probe the smallest length scales, interferometers that will detect the gravitational waves of binary pulsars, supernovae and other exotic objects, and automated sky surveys at arc-second or better resolution (> 1012 pixels) that will enable vastly improved systematic studies of stars, galaxies, nebula and large-scale structure.

The most ambitious and long-lived of these new experiments are the CMS and ATLAS experiments at the LHC (Large Hadron Collider), LIGO (Laser Interferometer Gravitational Observatory) and SDSS (Sloan Digital Sky Survey). For twenty years or more, the LHC will probe the TeV frontier of particle energies to search for new phenomena and improve our understanding of the nature of mass. LIGO will detect and analyze, over a similar span of time, nature's most energetic events sending gravitational waves across the cosmos. SDSS will survey a large fraction of the sky to provide the most comprehensive catalog of astronomical data ever recorded. The investigations of SDSS, LIGO and LHC, in which the National Science Foundation has made heavy investments, will involve thousands of scientists in all world regions.

Mining the scientific wealth of these experiments, over national and intercontinental distances for a period of decades, presents new problems in data access, processing and distribution, and collaboration across networks, on a scale never before encountered in the history of science. In the tradition of high energy physics collaborations, the LHC, LIGO and SDSS collaborations, working in concert with computer scientists,  will study, develop and implement solutions for the most pressing problems, in time for the startup of LIGO data-taking in 2002, and LHC data-taking in 2005. The goal of this work is to make US scientists, especially the youngest scientists with limited opportunities to travel to the experiment's principal sites, a highly productive resource, deeply engaged in the life of the experiments and the ongoing worldwide process of search and discovery. In order to achieve these technical and human goals a number of unprecedented challenges in information technology must be met:

  • Rapid and transparent access to data subsets drawn from massive datasets, rising from the 100 Terabyte to the 100 Petabyte scale over the next decade;

  • Transparent access to distributed CPU resources, from the Teraflops (2000) to the Petaflops (2010) scale;

  • Very small signals which must be extracted from enormous backgrounds;

  • An intellectual community, numbering in the thousands, distributed globally, and served by networks with bandwidths varying by orders of magnitude, that needs access to this data.

The GriPhyN (Grid Physics Network) collaboration, made up of computer scientists and physicists from the three experiments, proposes to implement the world's first production-scale computational grid to jointly meet these unprecedented challenges in computing, data-handling and networking . (A computational grid is a set of computing resources distributed over a geographically wide area, in this case the United States plus CERN). A properly implemented computational grid benefits LHC, LIGO and SDSS by greatly increasing the efficiency of accessing and processing their data, which in turn enhances the quality of the science. The grid idea is explored further in the next section.

Scaling up present hardware and software configurations is hopelessly inadequate. We argue instead that our collaboration, whose members have extensive experience in object-oriented (OO) software technology, distributed computing and large-scale data management, is uniquely qualified to apply and extend the research and prototype development in these areas to implement a production Grid system. This implementation will have, as a minimum, elements of OO design methodologies, distributed computing, high-speed networking, data caching and mirroring, tools for management, configuration and monitoring, OO application software and human support. A single R&D effort arranged around a common implementation provides the most cost-effective way of benefiting all three experiments.

There are differences between the four experiments that have a bearing on implementation strategy. First, the initial science runs of SDSS, LIGO and LHC are 2000, 2002 and 2005, respectively, motivating the early deployment of production services, albeit in limited form. Second, since LIGO's ability to search for some weak sources is limited by computing power, it can absorb PetaFlops of resources (1 PetaFlop = 106 GigaFlops), far more than is planned to be used for either LHC experiment. On the other hand, LHC will generate two orders of magnitude more data than LIGO (which itself will generate more data than SDSS), leading to correspondingly more difficult problems in the storage, movement and access to this data. The user community is also far larger for LHC (thousands) than SDSS or LIGO (hundreds) and more geographically dispersed.

Solution based on a Data Grid

We propose to integrate the computing resources obtained through this proposal with the facilities at CERN, the national laboratories and university workgroups into the first production-level hierarchical computational grid, transcontinental in size, taking advantage of the leadership and large accumulated R&D and prototype experience of collaboration members. We define five levels of Grid resources:

  • Tier 0: CERN (the Geneva site of the ATLAS and CMS experiments)
  • Tier 1: A US regional center for ATLAS, CMS, LIGO or SDSS
  • Tier 2: A regional center within the US, deployed at a university
  • Tier 3: The computing resources of a university research group
  • Tier 4: An individual workstation

We call the hierarchy above a Data Grid, reflecting our design strategy that each Tier is defined roughly by the scale of its storage and I/O throughput capabilities. Although all Tiers are part of the Data Grid, we propose to fund only Tier 2 hardware as well as the R&D effort, development of Grid-based applications and all Grid networking. We expect that DOE will provide funding for the large-scale computing needs of the Tier 1 centers for ATLAS and CMS since they will be located at national laboratories, and that individual university groups (Tier 3/4) will continue acquiring facilities through their base programs so that they maintain control of their own resources. Our effort is complementary to the functions of these other Tiers because it extends their existing capabilities and makes it possible for the first time to coherently mobilize large-scale resources for solving specific scientific problems.

Tier 2 represents a novel computing resource. A multi-gigabit backbone will connect the Tier 2 sites to each other and to the Tier 1 centers and CERN. Universities will connect to the backbone through previously existing Internet2-style connections, completing the hierarchy. The high-speed backbone enables the rapid data movement necessary to balance the computing and I/O loads across the Grid. We expect to have approximately 19-20 Tier 2 centers, six each for ATLAS and CMS, five for LIGO and 2-3 for SDSS. The criteria by which Tier 2 sites are selected are not fixed yet, but as a general rule they should be geographically dispersed, able to connect reasonably easily to high-speed networking, and located in areas that can take advantage of skilled R&D personnel. Most Tier 2 sites, particularly those in metropolitan areas, will provide fruitful collaboration between R&D developers and application programmers.

A Data Grid turns the notion of a worldwide collaboration clustered around a single research center on its head. It takes advantage of the greatest strength of the US university-based system: combining research and education into a single mission.  This system has produced more Nobel prize winners in basic science than any other. Instead of a flow of people and resources to the laboratory, the Data Grid flows data outwards from the laboratories to the nation's vibrant university communities where direct access to the data by faculty, research staff and students will vastly increase its impact. Demonstrations taking advantage of direct data access would extend this educational impact to the public, providing a rare opportunity to see science in the making and inspiring the next generation of scientists.

To summarize, the Data Grid we are proposing leverages the computing systems that will be in place when SDSS, LIGO and LHC begin taking data into a coherent resource that can be accessed from a desktop located anywhere in the world. A Data Grid combines computing, data storage, and networking in a highly transparent infrastructure that can be mined efficiently by the university research community to extract scientific results. This vision can be realized for a relatively small investment in hardware (the Tier 2 centers), fast networking linking the elements together, and a strong R&D effort which combines the results of existing research projects to integrate the elements into a production Grid. An investment leading to an operational Data Grid could pay itself back many times over, as it would spur private research into the development of future Grid systems that would unite far-flung company resources and play a role in the general economy.

GriPhyN personnel and the R&D effort

The GriPhyN collaboration is uniquely qualified to take on the R&D effort that is required to mold disparate computing resources into a functioning Grid. Our membership includes leading computer scientists at DOE laboratories and universities, the scientists specializing in computing and data handling at the four DOE particle-physics accelerators, university physicists who have led previous distributed computing initiatives, and leaders within the computing efforts of all three experiments. The presence of such high-level members within GriPhyN shows the broad support for this initiative within our separate experiments and provides us with political support and access to significant computing infrastructures, making it possible to link existing laboratory and university computing resources with the Tier 2 sites funded by this proposal to create Data Grid testbeds and, eventually, the first production implementation.

The ambitious goals of this project are achievable because we can take advantage of a strong base of software and expertise in such areas as high-performance networking, advanced networked middleware (e.g., Globus and DPSS), scientific data management, distributed fault-tolerant computing, tools for remote collaboration and high-performance storage systems. Members of our team and of other projects have carried out significant research in these areas and have deployed testbed Grid systems using a combination of existing and new facilities. We are confident that we will gain significant time in our R&D development, especially in the early stages, by building on the successes of these efforts, using a combination of collaboration and management linkages to extract the useful products from each project while avoiding interface mismatches. Furthermore, collaboration will expose the software tools from these projects to a wider class of problems and issues which will in turn increase their robustness.

Simulations are widely regarded as essential tools for the design, performance evaluation and optimization of complex distributed systems. We are aided in our R&D effort by the powerful MONARC simulation team, a joint ATLAS/CMS project whose goals are to (1) simulate LHC computing, (2) develop baseline models of computing, including the strategies, priorities and policies for effective data analysis by a worldwide collaboration, (3) verify and bracket resource requirement baselines for computing, data management and networking and (4) maximize performance of a given set of resources. MONARC uses its own Java 2-based advanced simulation tools, building on concepts of the rapidly expanding computing simulation market. For the advanced architecture we are proposing, simulation of Grid hardware and software components and their interactions are critical for understanding the overall system behavior, particularly when new elements are incorporated into the system. This understanding of performance cannot be overstated for a complex, distributed Grid in which scheduling of thousands of jobs will take place.

We expect to leverage the tools and experience of these projects to bring our project rapidly up to speed and implement early versions of Tier 2 centers to act as testbeds. Several GriPhyN participants are members of the following efforts:

 

Particle Physics Data Grid (PPDG)
China Clipper
Globus
MONARC
Nile, Condor
GIOD

Data intensive Grid testbed for HEP applications
Data intensive HEP analysis over wide area
Grid research, software, tools, middleware
Simulations of Grid components and interactions
Management of pools of processors
Large scale movement of object data between sites

Funding scale

We propose that NSF fund (1) R&D on Grid software tools, (2) development of data analysis applications to take full advantage of the Grid environment, (3) personnel and hardware needed at Tier 2 centers, (4) high-speed networking connecting the Tier 2 centers to each other and the Tier 1 centers, and (5) upgrade of the link to CERN to permit  Grid traffic, and (6) the associated deployment of state-of-the-art interactive analysis and remote collaboration tools.

Estimating the cost of a Tier 2 center is difficult because of uncertainties in cost evolution and the fact that part of the Tier 2 scope will be determined by the R&D program. However, we can make some projections based on a conservative use of Moore's Law and the kind of CPU and disk resources we want to have relative to Tier 1 sites and to networking bandwidth. Furthermore, we are imposing a condition that these sites, in contrast to Tier 1 centers, be built with (mostly) commodity hardware at existing university facilities to minimize support personnel and startup costs.

With these assumptions we estimate the Tier 2 cost to be approximately $2.0M - $2.2M for hardware (CPU, disk, commodity network switch, workstations, possible tertiary storage) and $200K/year for system support. Network costs are more difficult to estimate. The funding profile includes a gradual phasing in of the Tier 2 centers during the five-year lifetime of the proposal, taking account of early Tier 2 sites which will need to have their hardware refurbished towards the end of the project to exploit expected steep price/performance improvements. We have allocated $12M for US networking and $5M for the link to CERN for five years. The approximate 5-year funding scale is shown below, including a complete R&D effort:

 

$27M
$12M
$ 5M
$46M
$90M

System support and R&D personnel
US networking
Network link to CERN (5 years)
Hardware for Tier 2 sites
Total

Applicability of the Data Grid to other scientific domains

LHC, LIGO and SDSS are members of a new set of scientific projects planning to generate large data volumes that must be accessed and studied by a distributed user community. A Data Grid like the one proposed here would have broad benefit to these groups if implemented. Without further comment, we list these projects below:

 

The Earth Observing System Data Information System (3 PB by 2001);
The Earth System Grid;
The Human Brain Project (time series of 3-D images);
The Human Genome Project;
Automated astronomical scans (similar to SDSS) ;
Astrophysical data;
Geophysical data;
Satellite weather image analysis;
Molecular structure crystallography data;

The Data Grid and the future of computing

There are important advantages that derive from deploying computing resources in the form of a Data Grid. A highly decentralized Grid will enable a user situated anywhere in the world to efficiently access data and mobilize large-scale resources for scientific analysis. But the most profound result, discussed earlier, is the flow of data outward from the laboratories to their intellectual communities, strengthening the university infrastructure and broadening access to researchers, students and the public. Our Grid is a model of the worldwide distributed systems that could undergird future scientific and research collaborations. Imagine sites interlinked with a highly transparent computing/data storage/networking infrastructure focussing on high-speed data access, processing and rapid data transport and delivery of results.

As important as these benefits might be to science, they are dwarfed by the the role future Data Grids could play in the general economy. In the future, corporations would enjoy the same benefits that follow from transparent computing and data access linking distant sites. This benefit would extend to every part of the economy as the computing domain expands to hand-held devices and yet unforeseen elements, and as mobile agents connect these elements to global networks. Given the sustained exponential increase of computing power and network throughput over time, this vision of computing and advanced networking could be realized by the end of the first decade of the 21st century. An investment leading to an operational Data Grid would pay itself back many times over, as it would spur private research into the development of these future Grids in the same way as the NSF funding of national networks led to enormous private investment in network technologies and the Internet.

 

Last update: April 24, 2000 EST
Comments, suggestions to Paul Avery