|
ABSTRACT
In recent years we have seen an enormous growth in the size and prevalence of data processing workloads [Fayyad 1998, Gray 1997]. The picture that is becoming increasingly common is depicted in Figure 1. In it, organizations or resourceful individuals provide services via a set of loosely-coupled workstation nodes. The service is usually some form of data-mining like searching, filtering, or image recognition. Clients, which could be machines running web browsers, not only initiate requests, but also partake in the processing, with the goal of reducing the request turnaround. That is, when the servers are overloaded, clients with spare cycles take some of the computational burden. Naturally, many aspects of such a system cannot be determined at design time. E.g., exactly how much work a client should do depends on the computational resources available at the client and server cluster, the network bandwidth unused between them, and the workload demand. This position paper is interested in this and other aspects that must be divined at run-time to provide high performance and availability in data-parallel systems.What makes system tuning especially hard is that it's not possible to find the right knob-settings once and for all. A system upgrade or component failure may change the appropriate degree of data-parallelism. Changes in usable bandwidth may ask for a different partitioning of code among the client and server cluster. Moreover, an application may go through distinct phases during its execution. We should checkpoint the application for fault-tolerance less often during those phases in which checkpointing takes longer. Finally, the system needs to effectively allocate resources to concurrent applications, which can start at any time and which benefit differently from having these resources. In summary, we argue that in the future a significant fraction of computing will happen on architectures like Figure 1, and that, due to the architectures' inherent complexity, high availability and fast turnaround can only be realized by dynamically tuning a number of system parameters.Our position is that this tuning should be provided automatically by the system. The contrasting, application-specific view, contends that, to the extent possible, policies should be made by applications since they can make more informed optimizations. However, this requires a great deal of sophistication from the programmer. Further, it requires programmer time, one of the most scarce resources in systems building today.Toward our goal, we contribute a framework that is sufficiently rich to express a variety of interesting data-parallel applications, but which is also restricted enough so that the system can tune itself. These applications are built atop the ABACUS migration system, whose object placement algorithms are extended to reason about how many nodes should participate in a data-parallel computation, how to split up application objects among a client and server cluster, how often program state should be checkpointed, and the interaction (sometimes conflicting) between these questions. By automatically determining a number of critical parameters at runtime, we are minimizing the management costs which have in recent years given system administrators the howling fantods [Satyanarayanan 1999].
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
| |
2
|
{Amiri et al. 2000} Amiri, K., Petrou, D., Ganger, G. R., and Gibson, G. A. Dynamic function placement for data-intensive cluster computing. In Proceedings of the USENIX 2000 Annual Technical Conference, San Diego, CA, June 2000. Available at http://www.cs.cmu.edu/~dpetrou/research.html.
|
| |
3
|
{Anderson 1992} Anderson, T. E. The case for application-specific operating systems. In Proceedings of the Third Workshop on Workstation Operating Systems, Key Biscayne, FL, April 1992.
|
 |
4
|
Remzi H. Arpaci-Dusseau , Eric Anderson , Noah Treuhaft , David E. Culler , Joseph M. Hellerstein , David Patterson , Kathy Yelick, Cluster I/O with River: making the fast case common, Proceedings of the sixth workshop on I/O in parallel and distributed systems, p.10-22, May 05-05, 1999, Atlanta, Georgia, United States
[doi> 10.1145/301816.301823]
|
| |
5
|
{Carriero et al. 1993} Carriero, N., Gelernter, D., Kaminsky, D., and Westbrook, J. Adaptive parallelism with Piranha. Technical Report 954, Yale University Department of Computer Science, February 1993.
|
| |
6
|
{Fayyad 1998} Fayyad, U. Taming the giants and the monsters: Mining large databases for nuggets of knowledge. Database Programming and Design, 11(3), March 1998.
|
 |
7
|
Garth A. Gibson , David F. Nagle , Khalil Amiri , Jeff Butler , Fay W. Chang , Howard Gobioff , Charles Hardin , Erik Riedel , David Rochberg , Jim Zelenka, A cost-effective, high-bandwidth storage architecture, Proceedings of the eighth international conference on Architectural support for programming languages and operating systems, p.92-103, October 02-07, 1998, San Jose, California, United States
|
| |
8
|
{Gray 1997} Gray, J. Building petabyte databases and storage metrics, March 5 1997. Talk given at Carnegie Mellon University, available from http://research.microsoft.com/~gray/.
|
 |
9
|
|
 |
10
|
|
| |
11
|
M. Frans Kaashoek , Raymond Michiels , Henri E. Bal , Andrew S. Tanenbaum, Transparent fault-tolerance in parallel Orca programs, Papers from the symposium on Experiences with distributed and multiprocessor systems, p.297-311, January 1992, Newport Beach, California, United States
|
 |
12
|
David A. Patterson , Garth Gibson , Randy H. Katz, A case for redundant arrays of inexpensive disks (RAID), Proceedings of the 1988 ACM SIGMOD international conference on Management of data, p.109-116, June 01-03, 1988, Chicago, Illinois, United States
|
| |
13
|
|
| |
14
|
|
| |
15
|
{Satyanarayanan 1999} Satyanarayanan, M. Digest of Proceedings, Seventh IEEE Workshop on Hot Topics in Operating Systems, March 1999.
|
| |
16
|
{Schneider 1983} Schneider, F. B. Fail-stop processors. In Proceedings of the IEEE Spring COMPCON, pp. 66-70, San Francisco, CA, March 1983.
|
| |
17
|
|
 |
18
|
|
 |
19
|
|
Peer to Peer - Readers of this Article have also read:
-
Data structures for quadtree approximation and compression
Communications of the ACM
28, 9
Hanan Samet
-
A hierarchical single-key-lock access control using the Chinese remainder theorem
Proceedings of the 1992 ACM/SIGAPP Symposium on Applied computing
Kim S. Lee
, Huizhu Lu
, D. D. Fisher
-
The GemStone object database management system
Communications of the ACM
34, 10
Paul Butterworth
, Allen Otis
, Jacob Stein
-
Putting innovation to work: adoption strategies for multimedia communication systems
Communications of the ACM
34, 12
Ellen Francik
, Susan Ehrlich Rudman
, Donna Cooper
, Stephen Levine
-
An intelligent component database for behavioral synthesis
Proceedings of the 27th ACM/IEEE conference on Design automation
Gwo-Dong Chen
, Daniel D. Gajski
|