skip to main content
10.1145/2925426.2926274acmconferencesArticle/Chapter ViewAbstractPublication PagesicsConference Proceedingsconference-collections
research-article

Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication

Published: 01 June 2016 Publication History

Abstract

Designing high-performance software queues for fast intercore communication is challenging, but critical for maximising software parallelism. State-of-the-art single-producer / single-consumer queues for streaming applications contain multiple sections, requiring the producer and consumer to operate independently on different sections from each other. While these queues perform well for coarse-grained data transfers, they perform poorly in the fine-grained case.
This paper proposes Lynx, a novel SP/SC queue, specifically tuned for fine-grained communication. Lynx is built from the ground up, reducing the generated code on the critical-path to just two operations per enqueue and dequeue. To achieve this it relies on existing commodity processor hardware and operating system exception handling support to deal with infrequent queue maintenance operations. Lynx outperforms the state-of-the art by up to 1.57x in total 64-bit throughput reaching a peak throughput of 15.7GB/s on a common desktop system. Real applications using Lynx get a performance improvement of up to 1.4x.

References

[1]
GCC: Gnu Compiler Collection. http://gcc.gnu.org.
[2]
NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.
[3]
The LLVM Compiler Infrastructure. http://llvm.org.
[4]
Intel 64 and IA-32 Architectures Software Developer's Manual. 2015.
[5]
M. Abadi, T. Harris, and M. Mehrara. Transactional Memory with Strong Atomicity Using Off-the-shelf Memory Protection Hardware. In Proceedings of the 14th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2009.
[6]
A. W. Appel and K. Li. Virtual Memory Primitives for User Programs. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), 1991.
[7]
K. Gharachorloo and P. B. Gibbons. Detecting Violations of Sequential Consistency. In Proceedings of the 3rd Annual Symposium on Parallel Algorithms and Architectures (SPAA), 1991.
[8]
J. Giacomoni, T. Moseley, and M. Vachharajani. Fast-Forward for Efficient Pipeline Parallelism: A Cache- Optimized Concurrent Lock-free Queue. In Proceedings of the 13th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008.
[9]
M. Girkar and C. D. Polychronopoulos. Automatic Extraction of Functional Parallelism from Ordinary Programs. Transactions on Parallel and Distributed Systems, 3(2), 1992.
[10]
M. P. Herlihy and J. M. Wing. Linearizability: A Correctness Condition for Concurrent Objects. Transactions on Programming Languages and Systems, 12(3), 1990.
[11]
T. B. Jablin, Y. Zhang, J. A. Jablin, J. Huang, H. Kim, and D. I. August. Liberty Queues for EPIC Architectures. In Proceedings of EPIC Workshop, 2010.
[12]
M. Kim, H. Kim, and C.-K. Luk. SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. In Proceedings of the 43rd Annual International Symposium on Microarchitecture (MICRO 43), 2010.
[13]
E. Ladan-Mozes and N. Shavit. An Optimistic Approach to Lock-Free FIFO Queues. Distributed Computing, 20(5), 2007.
[14]
L. Lamport. Specifying Concurrent Program Modules. Transactions on Programming Languages and Systems, 5(2), 1983.
[15]
P. P. C. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Shared Ring Buffer for Multi-Core Architectures. In Proceedings of the 5th Symposium on Architectures for Networking and Communications Systems (ANCS), 2009.
[16]
P. P. C. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring. In International Parallel and Distributed Processing Symposium (IPDPS), 2010.
[17]
S. Lee, D. Tiwari, Y. Solihin, and J. Tuck. HAQu: Hardware-Accelerated Queueing for Fine-Grained Threading on a Chip Multiprocessor. In 17th International Symposium on High Performance Computer Architecture (HPCA), 2011.
[18]
M. M. Michael and M. L. Scott. Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors. Journal of Parallel and Distributed Computing, 51(1), 1998.
[19]
M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. Using Elimination to Implement Scalable and Lock-Free FIFO Queues. In Proceedings of the 17th Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), 2005.
[20]
G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic Thread Extraction with Decoupled Software Pipelining. In Proceedings of the 38th Annual International Symposium on Microarchitecture (MICRO 38), 2005.
[21]
S. Prakash, Y. H. Lee, and T. Johnson. A Nonblocking Algorithm for Shared Queues Using Compare-and-Swap. Transactions on Computers, 43(5), 1994.
[22]
W. N. Scherer, III, D. Lea, and M. L. Scott. Scalable Synchronous Queues. In Proceedings of the 11th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2006.
[23]
P. Tsigas and Y. Zhang. A Simple, Fast and Scalable Non-blocking Concurrent FIFO Queue for Shared Memory Multiprocessor Systems. In Proceedings of the 13th Annual Symposium on Parallel Algorithms and Architectures (SPAA), 2001.
[24]
C. Wang, H. s. Kim, Y. Wu, and V. Ying. Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection. In International Symposium on Code Generation and Optimization (CGO), 2007.
[25]
C. C. Wang and Y. Wu. Apparatus and Method for Redundant Software Thread Computation. 2010. US Patent 7,818,744.
[26]
J. Wang, K. Zhang, X. Tang, and B. Hua. B-Queue: Efficient and Practical Queuing for Fast Core-to-Core Communication. International Journal of Parallel Programming, 41(1), 2012.
[27]
E. Witchel, J. Cates, and K. Asanović. Mondrian memory protection. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X), 2002.
[28]
Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August. DAFT: Decoupled Acyclic Fault Tolerance. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2010.

Cited By

View all
  • (2023)DRAMHiT: A Hash Table Architected for the Speed of DRAMProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587457(817-834)Online publication date: 8-May-2023
  • (2021)Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS53633.2021.9614292(1-8)Online publication date: 3-Nov-2021
  • (2021)Inter-Core Communication Mechanisms for Microkernel Operating System based on Signal Transmission and Shared Memory2021 7th International Symposium on System and Software Reliability (ISSSR)10.1109/ISSSR53171.2021.00031(188-197)Online publication date: Sep-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICS '16: Proceedings of the 2016 International Conference on Supercomputing
June 2016
547 pages
ISBN:9781450343619
DOI:10.1145/2925426
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Fine-grained Communication
  2. Hardware Exceptions
  3. Single-Producer / Single-Consumer Software Queue

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICS '16
Sponsor:

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)7
  • Downloads (Last 6 weeks)1
Reflects downloads up to 09 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2023)DRAMHiT: A Hash Table Architected for the Speed of DRAMProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587457(817-834)Online publication date: 8-May-2023
  • (2021)Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS53633.2021.9614292(1-8)Online publication date: 3-Nov-2021
  • (2021)Inter-Core Communication Mechanisms for Microkernel Operating System based on Signal Transmission and Shared Memory2021 7th International Symposium on System and Software Reliability (ISSSR)10.1109/ISSSR53171.2021.00031(188-197)Online publication date: Sep-2021
  • (2021)A Concurrent Message Processing Mechanism for Internet of Vehicles2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA)10.1109/ICIBA52610.2021.9687947(778-783)Online publication date: 17-Dec-2021
  • (2021)On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threadingEuro-Par 2020: Parallel Processing Workshops10.1007/978-3-030-71593-9_23(290-302)Online publication date: 14-Mar-2021
  • (2020)EQueue: Elastic Lock-Free FIFO Queue for Core-to-Core Communication on Multi-Core ProcessorsIEEE Access10.1109/ACCESS.2020.29970718(98729-98741)Online publication date: 2020
  • (2018)Cache‐aware design of general‐purpose Single‐Producer–Single‐Consumer queuesSoftware: Practice and Experience10.1002/spe.267549:5(748-779)Online publication date: 26-Dec-2018
  • (2017)A Flexible Communication Mechanism for Pipeline Parallelism2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC)10.1109/ISPA/IUCC.2017.00119(778-785)Online publication date: Dec-2017
  • (2017)FFQ: A Fast Single-Producer/Multiple-Consumer Concurrent FIFO Queue2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.41(907-916)Online publication date: May-2017
  • (2017)A cache-friendly concurrent lock-free queue for efficient inter-core communication2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN)10.1109/ICCSN.2017.8230170(538-542)Online publication date: May-2017
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media