ABSTRACT
We present FGen, a program generator for high performance convolution operations (finite-impulse-response filters). The generator uses an internal mathematical DSL to enable structural optimization at a high level of abstraction. We use FGen as a testbed to demonstrate how to provide modular and extensible support for modern SIMD vector architectures in a DSL-based generator. Specifically, we show how to combine staging and generic programming with type classes to abstract over both the data type (real or complex) and the target architecture (e.g., SSE or AVX) when mapping DSL expressions to C code with explicit vector intrinsics. Benchmarks shows that the generated code is highly competitive with commercial libraries.
- B. Aktemur, Y. Kameyama, O. Kiselyov, and C.-c. Shan. Shonan challenge for generative programming: short position paper. In Proc. Partial evaluation and program manipulation (PEPM), pages 147--154, 2013. Google ScholarDigital Library
- P. Bientinesi, J. A. Gunnels, M. E. Myers, E. Quintana-Orti, and R. van de Geijn. The science of deriving dense linear algebra algorithms. ACM Trans. on Mathematical Software, 31(1):1--26, 2005. Google ScholarDigital Library
- C. Burke and R. Hui. J for the APL programmer. SIGAPL APL Quote Quad, 27(1):11--17, Sept. 1996. Google ScholarDigital Library
- A. Cohen, S. Donadio, M. jesus Garzaran, C. Herrmann, and D. Padua. In search of a program generator to implement generic transformations for high-performance computing. In 1 st MetaOCaml Workshop (associated with GPCE, pages 166771--7, 2004.Google Scholar
- Z. DeVito, J. Hegarty, A. Aiken, P. Hanrahan, and J. Vitek. Terra: A multi-stage language for high-performance computing. SIGPLAN Not., 48(6):105--116, June 2013. Google ScholarDigital Library
- F. Franchetti, Y. Voronenko, and M. Püschel. Formal loop merging for signal transforms. In Programming Languages Design and Implementation (PLDI), pages 315--326, 2005. Google ScholarDigital Library
- M. Frigo. A fast Fourier transform compiler. In Proc. Programming Language Design and Implementation (PLDI), pages 169--180, 1999. Google ScholarDigital Library
- J. A. Gunnels, F. G. Gustavson, G. M. Henry, and R. A. van de Geijn. FLAME: Formal linear algebra methods environment. ACM Trans. on Mathematical Software, 27(4):422--455, 2001. Google ScholarDigital Library
- K. E. Iverson. Programming Language. John Wiley & Sons Inc, 1962. Google ScholarDigital Library
- G. Keller, M. M. Chakravarty, R. Leshchinskiy, S. Peyton Jones, and B. Lippmeier. Regular, shape-polymorphic, parallel arrays in Haskell. volume 45, pages 261--272, New York, NY, USA, Sept. 2010. ACM. Google ScholarDigital Library
- Kx Systems. K reference manual version 2.0. 1998.Google Scholar
- J. McGraw, S. Skedzielewski, S. Allan, and R. Oldehoeft. Sisal: Streams and iteration in a single assignment language: Reference manual version 1.2. 1998.Google Scholar
- G. Ofenbeck, T. Rompf, A. Stojanov, M. Odersky, and M. Püschel. Spiral in Scala: Towards the systematic construction of generators for performance libraries. In International Conference on Generative Programming: Concepts & Experiences (GPCE), pages 125--134, 2013. Google ScholarDigital Library
- M. Püschel, F. Franchetti, and Y. Voronenko. Encyclopedia of Parallel Computing, chapter Spiral. Springer, 2011. Google ScholarDigital Library
- M. Püschel, J. M. F. Moura, J. Johnson, D. Padua, M. Veloso, B. Singer, J. Xiong, F. Franchetti, A. Gacic, Y. Voronenko, K. Chen, R. W. Johnson, and N. Rizzolo. SPIRAL: Code generation for DSP transforms. Proceedings of the IEEE, special issue on "Program Generation, Optimization, and Adaptation", 93(2):232-- 275, 2005.Google ScholarCross Ref
- J. Ragan-Kelley, C. Barnes, A. Adams, S. Paris, F. Durand, and S. P. Amarasinghe. Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In H.-J. Boehm and C. Flanagan, editors, PLDI, pages 519--530. ACM, 2013. Google ScholarDigital Library
- T. Rompf and M. Odersky. Lightweight modular staging: a pragmatic approach to runtime code generation and compiled dsls. Commun. ACM, 55(6):121--130, 2012. Google ScholarDigital Library
- T. Rompf, A. K. Sujeeth, N. Amin, K. Brown, V. Jovanovic, H. Lee, M. Jonnalagedda, K. Olukotun, and M. Odersky. Optimizing data structures in high-level programs. In Proc. Principles of programming languages (POPL), pages 497--510, 2013. Google ScholarDigital Library
- T. Rompf, A. K. Sujeeth, H. Lee, K. J. Brown, H. Chafi, M. Odersky, and K. Olukotun. Building-blocks for performance oriented DSLs. DSL, 2011.Google ScholarCross Ref
- S.-B. Scholz. Single assignment C -- efficient support for high-level array operations in a functional setting, 2003.Google Scholar
- A. Sinkarovs and S.-B. Scholz. Data layout inference for code vectorisation. In HPCS, pages 527--534. IEEE, 2013.Google ScholarCross Ref
- D. G. Spampinato and M. Püschel. A basic linear algebra compiler. In International Symposium on Code Generation and Optimization (CGO), 2014. Google ScholarDigital Library
- A. K. Sujeeth, H. Lee, K. J. Brown, T. Rompf, M. Wu, A. R. Atreya, M. Odersky, and K. Olukotun. OptiML: an implicitly parallel domain-specific language for machine learning. In Proceedings of the 28th International Conference on Machine Learning, ICML, 2011.Google Scholar
- A. K. Sujeeth, T. Rompf, K. J. Brown, H. Lee, H. Chafi, V. Popic, M. Wu, A. Prokopec, V. Jovanovic, M. Odersky, and K. Olukotun. Composition and reuse with compiled domain-specific languages. In G. Castagna, editor, ECOOP, volume 7920 of Lecture Notes in Computer Science, pages 52--78. Springer, 2013. Google ScholarDigital Library
- W. Taha and T. Sheard. Metaml and multi-stage programming with explicit annotations. Theor. Comput. Sci., 248(1-2):211--242, 2000. Google ScholarDigital Library
- V. Ureche, T. Rompf, A. K. Sujeeth, H. Chafi, and M. Odersky. StagedSAC: a case study in performance-oriented dsl development. In O. Kiselyov and S. Thompson, editors, PEPM, pages 73--82. ACM, 2012. Google ScholarDigital Library
- R. Whaley, A. Petitet, and J. Dongarra. Automated empirical optimization of software and the ATLAS project. Parallel Computing, 27(1-2):3--35, 2001.Google ScholarDigital Library
Index Terms
- Abstracting Vector Architectures in Library Generators: Case Study Convolution Filters
Recommendations
Out-of-order vector architectures
MICRO 30: Proceedings of the 30th annual ACM/IEEE international symposium on MicroarchitectureRegister renaming and out-of-order instruction issue are now commonly used in superscalar processors. These techniques can also be used to significant advantage in vector processors, as this paper shows. Performance is improved and available memory ...
Hygienic quasiquotation in scheme
Scheme '12: Proceedings of the 2012 Annual Workshop on Scheme and Functional ProgrammingQuasiquotation in Scheme is nearly ideal for implementing programs that generate other programs. These programs lack only the ability to generate fresh bound identifiers, as required to make such code-manipulating programs hygienic, but any Scheme ...
Surgical precision JIT compilers
PLDI '14Just-in-time (JIT) compilation of running programs provides more optimization opportunities than offline compilation. Modern JIT compilers, such as those in virtual machines like Oracle's HotSpot for Java or Google's V8 for JavaScript, rely on dynamic ...
Comments