No abstract available.
Analytical cache models with applications to cache partitioning
An accurate, tractable, analytic cache model for time-shared systems is presented, which estimates the overall cache miss-rate of a multiprocessing system with any cache size and time quanta. The input to the model consists of the isolated miss-rate ...
A synthesis of memory mechanisms for distributed architectures
Producing efficient parallel programs for distributed memory multiprocessors is a difficult task. Hand-coding efficient parallel programs for these systems can be extremely difficult, time consuming and error-prone, so people have turned to the shared ...
The trade-off between implicit and explicit data distribution in shared-memory programming paradigms
- Dimitrios S. Nikolopoulos,
- Eduard Ayguadé,
- Theodore S. Papatheodorou,
- Constantine D. Polychronopoulos,
- Jesús Labarta
This paper explores previously established and novel methods for scaling the performance of OpenMP on NUMA architectures. The spectrum of methods under investigation includes OS-level automatic page placement algorithms, dynamic page migrationd manual ...
Fractal symbolic analysis
Modern compilers perform wholesale restructuring of programs to improve their efficiency. Dependence analysis is the most widely used technique for proving the correctness of such transformations, but it suffers from the limitation that it considers ...
Data locality enhancement by memory reduction
In this paper, we propose memory reduction as a new approach to data locality enhancement. Under this approach, we use the compiler to reduce the size of the data repeatedly referenced in a collection of nested loops. Between their reuses, the data will ...
Eliminating redundancies in sum-of-product array computations
Array programming languages such as Fortran 90, High Performance Fortran and ZPL are well-suited to scientific computing because they free the scientist from the responsibility of managing burdensome low-level details that complicate programming in ...
Monotonic evolution: an alternative to induction variable substitution for dependence analysis
We present a new approach to dependence testing in the presence of induction variables. Instead of looking for closed form expressions, our method computes monotonic evolution which captures the direction in which the value of a variable changes. This ...
Optimizing strategies for telescoping languages: procedure strength reduction and procedure vectorization
At Rice University, we have undertaken a project to construct a framework for generating high-level problem solving languages that can achieve high performance on a variety of platforms.The underlying strategy, called telescoping languages, builds ...
Loop optimization for a class of memory-constrained computations
Compute-intensive multi-dimensional summations that involve products of several arrays arise in the modeling of electronic structure of materials. Sometimes several alternative formulations of a computation, representing different space-time trade-offs, ...
Fast parallel in-memory 64-bit sorting
Parallel in-memory 64-bit sorting is an important problem in Database Management Systems and other applications such as Internet Search Engines and Data Mining Tools.
We propose a new algorithm that we call Parallel Counting Split Radix sort, PCS-Radix ...
Optimizing locality for ODE solvers
Runge-Kutta methods are popular methods for the solution of systems of ordinary differential equations and are provided by many scientific libraries. The performance of Runge-Kutta methods does not only depend on the specific application problem to be ...
Array language support for parallel sparse computation
This paper describes an array-based language-level approach to parallel sparse computation. Our approach is unique due to its separation of sparse index sets from arrays, both syntactically and in the implementation. This design allows users to express ...
A parallel algorithm for sparse symbolic LU factorization without pivoting on out—of—core matrices
Finding the nonzero structures of the lower and upper triangular factors of an unsymmetric sparse matrix A is an important problem in the field of sparse matrix computations. Complementing previous research on sequential algorithms, we develop a ...
Tools for application-oriented performance tuning
Application performance tuning is a complex process that requires assembling various types of information and correlating it with source code to pinpoint the causes of performance bottlenecks. Existing performance tools don't adequately support this ...
Global optimization techniques for automatic parallelization of hybrid applications
This paper presents a novel technique to perform global optimization of communication and preprocessing calls in the presence of array accesses with arbitrary subscripts. Our scheme is presented in the context of automatic parallelization of sequential ...
Tuning high-performance scientific codes: the use of performance models to control resource usage during data migration and I/O
Large-scale parallel simulations are a popular tool for investigating phenomena ranging from nuclear explosions to protein folding. These codes produce copious output that must be moved to the workstation where it will be visualized. Scientists have a ...
Computer aided hand tuning (CAHT): “applying case-based reasoning to performance tuning”
For most parallel and high performance systems, tuning guides provide the users with advices to optimize the execution time of their programs. Execution time may be very sensitive to small program changes. Such modifications may be local (on loop) or ...
Cache performance for multimedia applications
The caching behavior of multimedia applications has been described as having high instruction reference locality within small loops, very large working sets, and poor data cache performance due to non-locality of data references. Despite this, there is ...
On the potential of tolerant region reuse for multimedia applications
The recent years have shown an interesting evolution in the mid-end to low-end embedded domain. Portable systems are growing in importance as they improve in storage capacity and in interaction capabilities with general purpose systems. Furthermore, ...
Evaluation of processor code efficiency for embedded systems
- Morgan Hirosuke Miki,
- Mamoru Sakamoto,
- Shingo Miyamoto,
- Yoshinori Takeuchi,
- Toyohiko Yoshida,
- Isao Shirakawa
This paper evaluates the code efficiency of the ARM, Java, and x86 instruction sets by compiling the SPEC CPU95/ CPU2000/JVM98 and CaffeineMark benchmarks, in terms of code sizes, basic block sizes, instruction distributions, and average instruction ...
Improving 3D geometry transformations on a simultaneous multithreaded SIMD processor
In this paper we evaluate the performance of an SMT processor used as the geometry processor for a 3D polygonal rendering engine. To evaluate this approach, we consider PMesa (a parallel version of Mesa) which parallelizes the geometry stage of the 3D ...
Bringing together automatic differentiation and OpenMP
Derivatives of almost arbitrary functions can be evaluated efficiently by automatic differentiation whenever the functions are given in the form of computer programs in a high-level programming language such as Fortran, C, or C++. Furthermore, in ...
Automatic code generation for a turbulence scheme
In this paper we describe how to extend CTADEL, a Problem Solving Environment, in order to generate code for a turbulence scheme, in our case, within a numerical weather prediction model (NWP). Common for these schemes is the presence of implicit ...
Towards the effective parallel computation of matrix pseudospectra
Given a matrix A, the computation of its pseudospectrum A∈ (A) is a far more expensive task than the computation of characteristics such as the condition number and the matrix spectrum. As research of the last 15 years has shown, however, the matrix ...
A graphical tool for driving the parallel computation of pseudosprectra
This paper presents the programming environment of a new tool for the parallel computation of Pseudospectra. Based on the PPA Talgorithm described in [16, 17], this algorithm offers total reliability and can handle singularities along the level curve ...
Register-sensitive selection, duplication, and sequencing of instructions
In this paper, we present a new framework for selecting, duplicating and sequencing instructions so as to decrease register pressure. The motivation for this work is to target current and future high-performance processors where reductions in register ...
Load and store reuse using register file contents
The detection of opportunities for value reuse optimizations in memory operations require both the addresses and values associated with these operations to be available. Although the values are typically available in the physical register file, their ...
Improving Gang Scheduling through job performance analysis and malleability
The OpenMP programming model provides parallel applications a very important feature: job malleability. Job malleability is the capacity of an application to dynamically adapt its parallelism to the number of processors allocated to it. We believe that ...
Reducing the complexity of the issue logic
The issue logic of dynamically scheduled superscalar processors is one of their most complex and power-consuming parts. In this paper we present alternative issue-logic designs that are much simpler than the traditional scheme while they retain most of ...
Slice-processors: an implementation of operation-based prediction
We describe the Slice Processor micro-architecture that implements a generalized operation-based prefetching mechanism. Operation-based prefetchers predict the series of operations, or the computation slice that can be used to calculate forthcoming ...
Index Terms
- Proceedings of the 15th international conference on Supercomputing