|
|||||||||||||||||||||||||
|
|||||||||||||||||||||||||
ABSTRACT
Due to disk and network latencies, I/O performance remains a major bottleneck for HPC on large datasets. As an important I/O optimization technique, prefetching and caching are widely employed in modern file systems to speed up data access. However, they are optimized for sequential locality and not usually effective for volumetric scientific data retrieval because spatial locality in query does not correspond to proximity in linear storage. Also, a prefetching policy must not only be efficient, having low overhead, but must also be effective, choosing the correct blocks to prefetch. The information about the future access pattern can help improve the effectiveness of prefetching and cache hit rate. Profile or hint based prefetching assumes that the complete access pattern is not available in advance and relies on certain prediction models to guess what to preload, which is error prone. In fact, many applications such as FFT and ray casting have regular access patterns that are known a priori and thus can be exploited to improve cache performance. Data parallelization is an effective way to accelerate the processing of massive data. It entails efficient data partitioning among processing elements and correct handling of dependencies to minimize the job extraction cost and communication overhead. During this process, knowledge about the data dependency in an application is crucial since it not only lays the basis to realize coarse-grained data partitioning with minimal communication cost through proper agglomeration algorithms, but also imposes constraints on the computation order over a data volume and thus is a key factor to formulate the access pattern used by the prefetching and caching mechanism to improve I/O performance. Most dependency detection models employ profile-based methods since exact analysis is difficult due to indirect pointers and aliases. However, the output of a profile-based dependency analysis is specific to the given input dataset and unsafe to serve as a general guide for the parallelization of all datasets. To realize the benefit of data parallelism with good I/O performance in scientific computation that centers on large multidimensional datasets, we propose an iteration and dependency aware data parallelization scheme for spatial computation on computing clusters. Built upon Granite's iteration aware spatial prefetching and caching techniques [1], this data parallelization scheme takes as input an explicit specification of the data dependency, identifies the best feasible access patterns and then wraps them in separate spatial data iterators for efficient cache loading and data partitioning respectively. Our data parallelization scheme makes following contributions. First, Granite maintains a spatial view of both data storage and iteration through the data. An access pattern is represented as an iterator that contains information such as the iteration space, iteration block shape, iteration directions and steps. Given any such spatial iterator and certain amount of memory, Granite can create an n-dimensional prefetching cache block that requires the lease number of disk reads or network transactions. Cache reloading is also avoided. These features make Granite particularly effective for efficient processing of spatial data. When applied to cluster computation, Granite can perform aggregated I/O for all participant compute nodes through collective caching [2]. Second, explicit dependency specification excludes the inaccuracy of profile-based dependency detection, making the data parallelization deadlock safe. It imposes a legal application start scope over the data volume, narrowing the choice of iterator for cache loading, and facilitates the quantification of communication cost, determining the choice of iterator for job partitioning. Lastly, our data parallelization scheme prioritizes but reconciles the I/O costs in the different stages of a typical cluster application to achieve the overall best I/O performance. And it can be done automatically with just a few inputs, saving scientists from programming miniatures. REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
|
|||||||||||||||||||||||||