ABSTRACT
Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and high-bandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.
- D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, "Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs," in ACM SIGARCH Computer Architecture News, vol. 37, no. 3, 2009, pp. 451--461. Google ScholarDigital Library
- A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. A. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung, "The MIT Alewife Machine: Architecture and Performance," in Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), 1995. Google ScholarDigital Library
- Anandtech, "Haswell: Up to 128MB On-Package Cache." {Online}. Available: http://www.anandtech.com/show/6277/haswell-up-to-128mb-onpackage-cache-ulv-gpu-performance-estimates.Google Scholar
- K. Asanović, "A Hardware Building Block for 2020 Warehouse-Scale Computers," USENIX FAST Keynote, 2014.Google Scholar
- B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, "Workload Analysis of a Large-Scale Key-Value Store," in ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 1, 2012, pp. 53--64. Google ScholarDigital Library
- L. A. Barroso, "Three Things to Save the Datacenter," ISSCC Keynote, 2014. {Online}. Available: http://www.theregister.co.uk/Print/2014/02/11/google_research_three_things_that_must_be_done_to_save_the_data_center_of_the_future/.Google Scholar
- N. L. Binkert, A. G. Saidi, and S. K. Reinhardt, "Integrated Network Interfaces for High-Bandwidth TCP/IP," in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006. Google ScholarDigital Library
- Boston Limited, "Boston Limited Unveil Their Revolutionary Boston Viridis," 2011. {Online}. Available: http://www.boston.co.uk/press/2011/11/boston-limited-unveil-their-revolutionary-boston-viridis.aspx.Google Scholar
- Calxeda Inc., "ECX-1000 Technical Specifications," 2012. {Online}. Available: http://www.calxeda.com/ecx-1000-techspecs/.Google Scholar
- Cavium Networks, "Cavium Announces Availability of ThunderX™: Industry's First 48 Core Family of ARMv8 Workload Optimized Processors for Next Generation Data Center & Cloud Infrastructure," 2014. {Online}. Available: http://www.cavium.com/newsevents-Cavium-Announces-Availability-of-ThunderX.html.Google Scholar
- J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta, "Hive: Fault Containment for Shared-Memory Multiprocessors," in Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarDigital Library
- A. Dhodapkar, G. Lauterbach, S. Li, D. Mallick, J. Bauman, S. Kanthadai, T. Kuzuhara, G. S. M. Xu, and C. Zhang, "SeaMicro SM10000-64 Server: Building Datacenter Servers Using Cell Phone Chips," in Proceedings of the 23rd IEEE HotChips Symposium, 2011.Google Scholar
- A. Dragojević, D. Narayanan, O. Hodson, and M. Castro, "FaRM: Fast Remote Memory," in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014. Google ScholarDigital Library
- EZchip Semiconductor Ltd., "EZchip Introduces TILE-Mx100World's Highest Core-Count ARM Processor Optimized for High-Performance Networking Applications," Press Release, 2015. {Online}. Available: http://www.tilera.com/News/PressRelease/?ezchip=97.Google Scholar
- B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood, "Application-Specific Protocols for User-Level Shared Memory," in Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (SC), 1994. Google ScholarDigital Library
- B. Falsafi and D. A. Wood, "Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA," in Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997. Google ScholarDigital Library
- J. Gantz and D. Reinsel, "The Digital Universe in 2020." IDC, 2012. {Online}. Available: http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf.Google Scholar
- E. Hagersten and M. Koster, "Wildfire: A scalable path for smps," in Proceedings of the Fifth International Symposium on High-Performance Computer Architecture (HPCA), 1999. Google ScholarDigital Library
- N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," in 36th International Symposium on Computer Architecture (ISCA), 2009. Google ScholarDigital Library
- J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor," in ACM SIGPLAN Notices, vol. 29, no. 11, 1994, pp. 38--50. Google ScholarDigital Library
- Hewlett - Packard Development Company, "HP ProLiant m400 Server Cartridge," 2014. {Online}. Available: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04384048.Google Scholar
- Hewlett-Packard Development Company, "HP Moonshot System Family Guide," 2014. {Online}. Available: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=4AA4-6076ENW.Google Scholar
- R. Huggahalli, R. Iyer, and S. Tetrick, "Direct Cache Access for High Bandwidth Network I/O," in Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), 2005. Google ScholarDigital Library
- Intel, "Moving Data with Silicon and Light," 2013. {Online}. Available: http://www.intel.com/content/www/us/en/research/intel-labs-silicon-photonics-research.html.Google Scholar
- J. Jeddeloh and B. Keeth, "Hybrid Memory Cube New DRAM Architecture Increases Density and Performance," in 2012 International Symposium on VLSI Technology (VLSIT), 2012.Google Scholar
- D. Kanter, "X-Gene 2 Aims Above Microservers," Microprocessor Report, vol. 28(9), pp. 20--24, 2014.Google Scholar
- R. Kessler and J. Schwarzmeier, "Cray T3D: A New Dimension for Cray Research," in Compcon Spring '93, Digest of Papers, 1993.Google ScholarCross Ref
- J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy, "The Stanford FLASH Multiprocessor," in Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
- J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," in Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997, pp. 241--251. Google ScholarDigital Library
- D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam, "The Stanford Dash Multiprocessor," IEEE Computer, vol. 25, no. 3, pp. 63--79, 1992. Google ScholarDigital Library
- G. Liao, X. Zhu, and L. Bnuyan, "A New Server I/O Architecture for High Speed Networks," in Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), 2011. Google ScholarDigital Library
- K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, "Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), 2013. Google ScholarDigital Library
- P. Lotfi-Kamran, B. Grot, and B. Falsafi, "NOC-Out: Microarchitecting a Scale-Out Processor," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012. Google ScholarDigital Library
- P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, Y. O. Koçberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Özer, and B. Falsafi, "Scale-Out Processors," in Proceedings of the 39th International Symposium on Computer Architecture (ISCA), 2012. Google ScholarDigital Library
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, "Pregel: A System for Large-Scale Graph Processing," in Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2010. Google ScholarDigital Library
- Mellanox Corp., "ConnectX-3 Pro Product Brief," 2012. {Online}. Available: http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-3_Pro_Card_EN.pdf.Google Scholar
- S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood, "Coherent Network Interfaces for Fine-Grain Communication," in Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), 1996. Google ScholarDigital Library
- S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "Scale-Out NUMA," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarDigital Library
- D. N. Paolo Costa, Hitesh Ballani, "Rethinking the Network Stack for Rack-Scale Computers," in Hot Topics in Cloud Computing (HotCloud). USENIX, 2014. Google ScholarDigital Library
- S. K. Reinhardt, J. R. Larus, and D. A. Wood, "Tempest and Typhoon: User-Level Shared Memory," in Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarDigital Library
- S. L. Scott and G. M. Thorson, "The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus," in Hot Interconnects, 1996.Google Scholar
- D. Seo, A. Ali, W.-T. Lim, N. Rafique, and M. Thottethodi, "Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks," in Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), 2005. Google ScholarDigital Library
- W. Shi, E. Collins, and V. Karamcheti, "Modeling Object Characteristics of Dynamic Web Content," Journal of Parallel and Distributed Computing, vol. 63, no. 10, pp. 963--980, 2003. Google ScholarDigital Library
- J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, "Reunion: Complexity-Effective Multicore Redundancy," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2006. Google ScholarDigital Library
- B. Towles, J. Grossman, B. Greskamp, and D. E. Shaw, "Unifying On-Chip and Inter-Node Switching within the Anton 2 Network," in Proceedings of the 41st International Symposium on Computer Architecture (ISCA), 2014. Google ScholarDigital Library
- T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, "SimFlex: Statistical Sampling of Computer System Simulation," IEEE Micro, vol. 26, pp. 18--31, 2006. Google ScholarDigital Library
Index Terms
- Manycore network interfaces for in-memory rack-scale computing
Recommendations
Manycore network interfaces for in-memory rack-scale computing
ISCA'15Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on ...
A network congestion-aware memory subsystem for manycore
Special Section on Wireless Health Systems, On-Chip and Off-Chip Network ArchitecturesThe network-on-chip (NoC) plays a crucial role in memory performance due to the fact that it can handle the majority of traffics from/to the DRAM memory controllers. However, there has been little work on the interplay between the NoC and memory ...
Performance evaluation of wormhole routed network processor-memory interconnects
IPDPS'06: Proceedings of the 20th international conference on Parallel and distributed processingNetwork line cards are experiencing ever increasing line rates, random data bursts, and limited space. Hence, they are more vulnerable than other processormemory environments, to create data transfer bottlenecks and hot-spots. Solutions to the memory ...
Comments