research-article

Lynx: Using OS and Hardware Support for Fast Fine-Grained Inter-Core Communication

Authors:

Konstantina Mitropoulou,

Vasileios Porpodas,

Xiaochun Zhang,

Timothy M. JonesAuthors Info & Claims

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

Article No.: 18, Pages 1 - 12

https://doi.org/10.1145/2925426.2926274

Published: 01 June 2016 Publication History

Abstract

Designing high-performance software queues for fast intercore communication is challenging, but critical for maximising software parallelism. State-of-the-art single-producer / single-consumer queues for streaming applications contain multiple sections, requiring the producer and consumer to operate independently on different sections from each other. While these queues perform well for coarse-grained data transfers, they perform poorly in the fine-grained case.

This paper proposes Lynx, a novel SP/SC queue, specifically tuned for fine-grained communication. Lynx is built from the ground up, reducing the generated code on the critical-path to just two operations per enqueue and dequeue. To achieve this it relies on existing commodity processor hardware and operating system exception handling support to deal with infrequent queue maintenance operations. Lynx outperforms the state-of-the art by up to 1.57x in total 64-bit throughput reaching a peak throughput of 15.7GB/s on a common desktop system. Real applications using Lynx get a performance improvement of up to 1.4x.

References

[1]

GCC: Gnu Compiler Collection. http://gcc.gnu.org.

[2]

NAS Parallel Benchmarks. http://www.nas.nasa.gov/publications/npb.html.

[3]

The LLVM Compiler Infrastructure. http://llvm.org.

[4]

Intel 64 and IA-32 Architectures Software Developer's Manual. 2015.

[5]

M. Abadi, T. Harris, and M. Mehrara. Transactional Memory with Strong Atomicity Using Off-the-shelf Memory Protection Hardware. In Proceedings of the 14th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2009.

Digital Library

[6]

A. W. Appel and K. Li. Virtual Memory Primitives for User Programs. In Proceedings of the 4th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS IV), 1991.

Digital Library

[7]

K. Gharachorloo and P. B. Gibbons. Detecting Violations of Sequential Consistency. In Proceedings of the 3rd Annual Symposium on Parallel Algorithms and Architectures (SPAA), 1991.

Digital Library

[8]

J. Giacomoni, T. Moseley, and M. Vachharajani. Fast-Forward for Efficient Pipeline Parallelism: A Cache- Optimized Concurrent Lock-free Queue. In Proceedings of the 13th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2008.

Digital Library

[9]

M. Girkar and C. D. Polychronopoulos. Automatic Extraction of Functional Parallelism from Ordinary Programs. Transactions on Parallel and Distributed Systems, 3(2), 1992.

Digital Library

[10]

M. P. Herlihy and J. M. Wing. Linearizability: A Correctness Condition for Concurrent Objects. Transactions on Programming Languages and Systems, 12(3), 1990.

Digital Library

[11]

T. B. Jablin, Y. Zhang, J. A. Jablin, J. Huang, H. Kim, and D. I. August. Liberty Queues for EPIC Architectures. In Proceedings of EPIC Workshop, 2010.

[12]

M. Kim, H. Kim, and C.-K. Luk. SD3: A Scalable Approach to Dynamic Data-Dependence Profiling. In Proceedings of the 43rd Annual International Symposium on Microarchitecture (MICRO 43), 2010.

Digital Library

[13]

E. Ladan-Mozes and N. Shavit. An Optimistic Approach to Lock-Free FIFO Queues. Distributed Computing, 20(5), 2007.

[14]

L. Lamport. Specifying Concurrent Program Modules. Transactions on Programming Languages and Systems, 5(2), 1983.

Digital Library

[15]

P. P. C. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Shared Ring Buffer for Multi-Core Architectures. In Proceedings of the 5th Symposium on Architectures for Networking and Communications Systems (ANCS), 2009.

Digital Library

[16]

P. P. C. Lee, T. Bu, and G. Chandranmenon. A Lock-Free, Cache-Efficient Multi-Core Synchronization Mechanism for Line-Rate Network Traffic Monitoring. In International Parallel and Distributed Processing Symposium (IPDPS), 2010.

[17]

S. Lee, D. Tiwari, Y. Solihin, and J. Tuck. HAQu: Hardware-Accelerated Queueing for Fine-Grained Threading on a Chip Multiprocessor. In 17th International Symposium on High Performance Computer Architecture (HPCA), 2011.

Digital Library

[18]

M. M. Michael and M. L. Scott. Nonblocking Algorithms and Preemption-Safe Locking on Multiprogrammed Shared Memory Multiprocessors. Journal of Parallel and Distributed Computing, 51(1), 1998.

Digital Library

[19]

M. Moir, D. Nussbaum, O. Shalev, and N. Shavit. Using Elimination to Implement Scalable and Lock-Free FIFO Queues. In Proceedings of the 17th Annual Symposium on Parallelism in Algorithms and Architectures (SPAA), 2005.

Digital Library

[20]

G. Ottoni, R. Rangan, A. Stoler, and D. I. August. Automatic Thread Extraction with Decoupled Software Pipelining. In Proceedings of the 38th Annual International Symposium on Microarchitecture (MICRO 38), 2005.

Digital Library

[21]

S. Prakash, Y. H. Lee, and T. Johnson. A Nonblocking Algorithm for Shared Queues Using Compare-and-Swap. Transactions on Computers, 43(5), 1994.

Digital Library

[22]

W. N. Scherer, III, D. Lea, and M. L. Scott. Scalable Synchronous Queues. In Proceedings of the 11th Symposium on Principles and Practice of Parallel Programming (PPoPP), 2006.

Digital Library

[23]

P. Tsigas and Y. Zhang. A Simple, Fast and Scalable Non-blocking Concurrent FIFO Queue for Shared Memory Multiprocessor Systems. In Proceedings of the 13th Annual Symposium on Parallel Algorithms and Architectures (SPAA), 2001.

Digital Library

[24]

C. Wang, H. s. Kim, Y. Wu, and V. Ying. Compiler-Managed Software-based Redundant Multi-Threading for Transient Fault Detection. In International Symposium on Code Generation and Optimization (CGO), 2007.

Digital Library

[25]

C. C. Wang and Y. Wu. Apparatus and Method for Redundant Software Thread Computation. 2010. US Patent 7,818,744.

[26]

J. Wang, K. Zhang, X. Tang, and B. Hua. B-Queue: Efficient and Practical Queuing for Fast Core-to-Core Communication. International Journal of Parallel Programming, 41(1), 2012.

[27]

E. Witchel, J. Cates, and K. Asanović. Mondrian memory protection. In Proceedings of the 10th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS X), 2002.

Digital Library

[28]

Y. Zhang, J. W. Lee, N. P. Johnson, and D. I. August. DAFT: Decoupled Acyclic Fault Tolerance. In Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2010.

Digital Library

Cited By

Narayanan VDetweiler DHuang TBurtsev AFedorova ANarayanan DDi Luna GQuerzoni L(2023)DRAMHiT: A Hash Table Architected for the Speed of DRAMProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587457(817-834)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587457
Nookala PDinda PHale KChard KRaicu I(2021)Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS53633.2021.9614292(1-8)Online publication date: 3-Nov-2021
https://doi.org/10.1109/MASCOTS53633.2021.9614292
Liu CLuo LLi MLei PChen LXiao K(2021)Inter-Core Communication Mechanisms for Microkernel Operating System based on Signal Transmission and Shared Memory2021 7th International Symposium on System and Software Reliability (ISSSR)10.1109/ISSSR53171.2021.00031(188-197)Online publication date: Sep-2021
https://doi.org/10.1109/ISSSR53171.2021.00031
Show More Cited By

Recommendations

Lynx: A SmartNIC-driven Accelerator-centric Architecture for Network Servers
ASPLOS '20: Proceedings of the Twenty-Fifth International Conference on Architectural Support for Programming Languages and Operating Systems

This paper explores new opportunities afforded by the growing deployment of compute and I/O accelerators to improve the performance and efficiency of hardware-accelerated computing services in data centers.

We propose Lynx, an accelerator-centric ...
Kilo TM: Hardware Transactional Memory for GPU Architectures

Programming GPUs is challenging for applications with irregular fine-grained communication between threads. To improve the programmability of GPUs and thus extend their usage to a wider range of applications, the authors propose to enable transactional ...
Lynx: A dynamic instrumentation system for data-parallel applications on GPGPU architectures
ISPASS '12: Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software

As parallel execution platforms continue to proliferate, there is a growing need for real-time introspection tools to provide insight into platform behavior for performance debugging, correctness checks, and to drive effective resource management ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ICS '16: Proceedings of the 2016 International Conference on Supercomputing

June 2016

547 pages

ISBN:9781450343619

DOI:10.1145/2925426

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 June 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

ICS '16

Sponsor:

SIGARCH

ICS '16: 2016 International Conference on Supercomputing

June 1 - 3, 2016

Istanbul, Turkey

Acceptance Rates

Overall Acceptance Rate 629 of 2,180 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
218
Total Downloads

Downloads (Last 12 months)7
Downloads (Last 6 weeks)1

Reflects downloads up to 09 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Narayanan VDetweiler DHuang TBurtsev AFedorova ANarayanan DDi Luna GQuerzoni L(2023)DRAMHiT: A Hash Table Architected for the Speed of DRAMProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587457(817-834)Online publication date: 8-May-2023
https://dl.acm.org/doi/10.1145/3552326.3587457
Nookala PDinda PHale KChard KRaicu I(2021)Enabling Extremely Fine-grained Parallelism via Scalable Concurrent Queues on Modern Many-core Architectures2021 29th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)10.1109/MASCOTS53633.2021.9614292(1-8)Online publication date: 3-Nov-2021
https://doi.org/10.1109/MASCOTS53633.2021.9614292
Liu CLuo LLi MLei PChen LXiao K(2021)Inter-Core Communication Mechanisms for Microkernel Operating System based on Signal Transmission and Shared Memory2021 7th International Symposium on System and Software Reliability (ISSSR)10.1109/ISSSR53171.2021.00031(188-197)Online publication date: Sep-2021
https://doi.org/10.1109/ISSSR53171.2021.00031
Liu CWang P(2021)A Concurrent Message Processing Mechanism for Internet of Vehicles2021 IEEE 2nd International Conference on Information Technology, Big Data and Artificial Intelligence (ICIBA)10.1109/ICIBA52610.2021.9687947(778-783)Online publication date: 17-Dec-2021
https://doi.org/10.1109/ICIBA52610.2021.9687947
Pérez DRopars TMeneses E(2021)On the Detection of Silent Data Corruptions in HPC Applications Using Redundant Multi-threadingEuro-Par 2020: Parallel Processing Workshops10.1007/978-3-030-71593-9_23(290-302)Online publication date: 14-Mar-2021
https://doi.org/10.1007/978-3-030-71593-9_23
Wang JTian YFu X(2020)EQueue: Elastic Lock-Free FIFO Queue for Core-to-Core Communication on Multi-Core ProcessorsIEEE Access10.1109/ACCESS.2020.29970718(98729-98741)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.2997071
Maffione VLettieri GRizzo L(2018)Cache‐aware design of general‐purpose Single‐Producer–Single‐Consumer queuesSoftware: Practice and Experience10.1002/spe.267549:5(748-779)Online publication date: 26-Dec-2018
https://doi.org/10.1002/spe.2675
Wang JTian YLi TFu X(2017)A Flexible Communication Mechanism for Pipeline Parallelism2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC)10.1109/ISPA/IUCC.2017.00119(778-785)Online publication date: Dec-2017
https://doi.org/10.1109/ISPA/IUCC.2017.00119
Arnautov SFelber PFetzer CTrach B(2017)FFQ: A Fast Single-Producer/Multiple-Consumer Concurrent FIFO Queue2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)10.1109/IPDPS.2017.41(907-916)Online publication date: May-2017
https://doi.org/10.1109/IPDPS.2017.41
Meng XZeng XChen XYe X(2017)A cache-friendly concurrent lock-free queue for efficient inter-core communication2017 IEEE 9th International Conference on Communication Software and Networks (ICCSN)10.1109/ICCSN.2017.8230170(538-542)Online publication date: May-2017
https://doi.org/10.1109/ICCSN.2017.8230170
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten