research-article

Manycore network interfaces for in-memory rack-scale computing

Authors:
Alexandros Daglis

EcoCloud, EPFL

EcoCloud, EPFL
View Profile

,
Stanko Novaković

EcoCloud, EPFL

EcoCloud, EPFL
View Profile

,
Edouard Bugnion

EcoCloud, EPFL

EcoCloud, EPFL
View Profile

,
Babak Falsafi

EcoCloud, EPFL

EcoCloud, EPFL
View Profile

,
Boris Grot

University of Edinburgh

University of Edinburgh
View Profile

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitectureJune 2015Pages 567–579https://doi.org/10.1145/2749469.2750415

Published:13 June 2015Publication History

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 567–579

ABSTRACT

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and high-bandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.

References

D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, "Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs," in ACM SIGARCH Computer Architecture News, vol. 37, no. 3, 2009, pp. 451--461. Google ScholarDigital Library
A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. A. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung, "The MIT Alewife Machine: Architecture and Performance," in Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), 1995. Google ScholarDigital Library
Anandtech, "Haswell: Up to 128MB On-Package Cache." {Online}. Available: http://www.anandtech.com/show/6277/haswell-up-to-128mb-onpackage-cache-ulv-gpu-performance-estimates.Google Scholar
K. Asanović, "A Hardware Building Block for 2020 Warehouse-Scale Computers," USENIX FAST Keynote, 2014.Google Scholar
B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, "Workload Analysis of a Large-Scale Key-Value Store," in ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 1, 2012, pp. 53--64. Google ScholarDigital Library
L. A. Barroso, "Three Things to Save the Datacenter," ISSCC Keynote, 2014. {Online}. Available: http://www.theregister.co.uk/Print/2014/02/11/google_research_three_things_that_must_be_done_to_save_the_data_center_of_the_future/.Google Scholar
N. L. Binkert, A. G. Saidi, and S. K. Reinhardt, "Integrated Network Interfaces for High-Bandwidth TCP/IP," in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006. Google ScholarDigital Library
Boston Limited, "Boston Limited Unveil Their Revolutionary Boston Viridis," 2011. {Online}. Available: http://www.boston.co.uk/press/2011/11/boston-limited-unveil-their-revolutionary-boston-viridis.aspx.Google Scholar
Calxeda Inc., "ECX-1000 Technical Specifications," 2012. {Online}. Available: http://www.calxeda.com/ecx-1000-techspecs/.Google Scholar
Cavium Networks, "Cavium Announces Availability of ThunderX™: Industry's First 48 Core Family of ARMv8 Workload Optimized Processors for Next Generation Data Center & Cloud Infrastructure," 2014. {Online}. Available: http://www.cavium.com/newsevents-Cavium-Announces-Availability-of-ThunderX.html.Google Scholar
J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta, "Hive: Fault Containment for Shared-Memory Multiprocessors," in Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarDigital Library
A. Dhodapkar, G. Lauterbach, S. Li, D. Mallick, J. Bauman, S. Kanthadai, T. Kuzuhara, G. S. M. Xu, and C. Zhang, "SeaMicro SM10000-64 Server: Building Datacenter Servers Using Cell Phone Chips," in Proceedings of the 23rd IEEE HotChips Symposium, 2011.Google Scholar
A. Dragojević, D. Narayanan, O. Hodson, and M. Castro, "FaRM: Fast Remote Memory," in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014. Google ScholarDigital Library
EZchip Semiconductor Ltd., "EZchip Introduces TILE-Mx100World's Highest Core-Count ARM Processor Optimized for High-Performance Networking Applications," Press Release, 2015. {Online}. Available: http://www.tilera.com/News/PressRelease/?ezchip=97.Google Scholar
B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood, "Application-Specific Protocols for User-Level Shared Memory," in Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (SC), 1994. Google ScholarDigital Library
B. Falsafi and D. A. Wood, "Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA," in Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997. Google ScholarDigital Library
J. Gantz and D. Reinsel, "The Digital Universe in 2020." IDC, 2012. {Online}. Available: http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf.Google Scholar
E. Hagersten and M. Koster, "Wildfire: A scalable path for smps," in Proceedings of the Fifth International Symposium on High-Performance Computer Architecture (HPCA), 1999. Google ScholarDigital Library
N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," in 36th International Symposium on Computer Architecture (ISCA), 2009. Google ScholarDigital Library
J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor," in ACM SIGPLAN Notices, vol. 29, no. 11, 1994, pp. 38--50. Google ScholarDigital Library
Hewlett - Packard Development Company, "HP ProLiant m400 Server Cartridge," 2014. {Online}. Available: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04384048.Google Scholar
Hewlett-Packard Development Company, "HP Moonshot System Family Guide," 2014. {Online}. Available: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=4AA4-6076ENW.Google Scholar
R. Huggahalli, R. Iyer, and S. Tetrick, "Direct Cache Access for High Bandwidth Network I/O," in Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), 2005. Google ScholarDigital Library
Intel, "Moving Data with Silicon and Light," 2013. {Online}. Available: http://www.intel.com/content/www/us/en/research/intel-labs-silicon-photonics-research.html.Google Scholar
J. Jeddeloh and B. Keeth, "Hybrid Memory Cube New DRAM Architecture Increases Density and Performance," in 2012 International Symposium on VLSI Technology (VLSIT), 2012.Google Scholar
D. Kanter, "X-Gene 2 Aims Above Microservers," Microprocessor Report, vol. 28(9), pp. 20--24, 2014.Google Scholar
R. Kessler and J. Schwarzmeier, "Cray T3D: A New Dimension for Cray Research," in Compcon Spring '93, Digest of Papers, 1993.Google ScholarCross Ref
J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy, "The Stanford FLASH Multiprocessor," in Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," in Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997, pp. 241--251. Google ScholarDigital Library
D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam, "The Stanford Dash Multiprocessor," IEEE Computer, vol. 25, no. 3, pp. 63--79, 1992. Google ScholarDigital Library
G. Liao, X. Zhu, and L. Bnuyan, "A New Server I/O Architecture for High Speed Networks," in Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), 2011. Google ScholarDigital Library
K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, "Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), 2013. Google ScholarDigital Library
P. Lotfi-Kamran, B. Grot, and B. Falsafi, "NOC-Out: Microarchitecting a Scale-Out Processor," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012. Google ScholarDigital Library
P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, Y. O. Koçberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Özer, and B. Falsafi, "Scale-Out Processors," in Proceedings of the 39th International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, "Pregel: A System for Large-Scale Graph Processing," in Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2010. Google ScholarDigital Library
Mellanox Corp., "ConnectX-3 Pro Product Brief," 2012. {Online}. Available: http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-3_Pro_Card_EN.pdf.Google Scholar
S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood, "Coherent Network Interfaces for Fine-Grain Communication," in Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), 1996. Google ScholarDigital Library
S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "Scale-Out NUMA," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarDigital Library
D. N. Paolo Costa, Hitesh Ballani, "Rethinking the Network Stack for Rack-Scale Computers," in Hot Topics in Cloud Computing (HotCloud). USENIX, 2014. Google ScholarDigital Library
S. K. Reinhardt, J. R. Larus, and D. A. Wood, "Tempest and Typhoon: User-Level Shared Memory," in Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
S. L. Scott and G. M. Thorson, "The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus," in Hot Interconnects, 1996.Google Scholar
D. Seo, A. Ali, W.-T. Lim, N. Rafique, and M. Thottethodi, "Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks," in Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), 2005. Google ScholarDigital Library
W. Shi, E. Collins, and V. Karamcheti, "Modeling Object Characteristics of Dynamic Web Content," Journal of Parallel and Distributed Computing, vol. 63, no. 10, pp. 963--980, 2003. Google ScholarDigital Library
J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, "Reunion: Complexity-Effective Multicore Redundancy," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2006. Google ScholarDigital Library
B. Towles, J. Grossman, B. Greskamp, and D. E. Shaw, "Unifying On-Chip and Inter-Node Switching within the Anton 2 Network," in Proceedings of the 41st International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, "SimFlex: Statistical Sampling of Computer System Simulation," IEEE Micro, vol. 26, pp. 18--31, 2006. Google ScholarDigital Library

Index Terms

Manycore network interfaces for in-memory rack-scale computing
1. Hardware

Recommendations

Manycore network interfaces for in-memory rack-scale computing
ISCA'15

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
Read More
A network congestion-aware memory subsystem for manycore
Special Section on Wireless Health Systems, On-Chip and Off-Chip Network Architectures

The network-on-chip (NoC) plays a crucial role in memory performance due to the fact that it can handle the majority of traffics from/to the DRAM memory controllers. However, there has been little work on the interplay between the NoC and memory ...
Read More
Performance evaluation of wormhole routed network processor-memory interconnects
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processing

Network line cards are experiencing ever increasing line rates, random data bursts, and limited space. Hence, they are more vulnerable than other processormemory environments, to create data transfer bottlenecks and hot-spots. Solutions to the memory ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell
ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 28
  Total Citations
  View Citations
- 524
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Manycore network interfaces for in-memory rack-scale computing

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

Manycore network interfaces for in-memory rack-scale computing

A network congestion-aware memory subsystem for manycore

Performance evaluation of wormhole routed network processor-memory interconnects