skip to main content
10.1145/2749469.2750415acmconferencesArticle/Chapter ViewAbstractPublication PagesiscaConference Proceedingsconference-collections
research-article

Manycore network interfaces for in-memory rack-scale computing

Published:13 June 2015Publication History

ABSTRACT

Datacenter operators rely on low-cost, high-density technologies to maximize throughput for data-intensive services with tight tail latencies. In-memory rack-scale computing is emerging as a promising paradigm in scale-out datacenters capitalizing on commodity SoCs, low-latency and high-bandwidth communication fabrics and a remote memory access model to enable aggregation of a rack's memory for critical data-intensive applications such as graph processing or key-value stores. Low latency and high bandwidth not only dictate eliminating communication bottlenecks in the software protocols and off-chip fabrics but also a careful on-chip integration of network interfaces. The latter is a key challenge especially in architectures with RDMA-inspired one-sided operations that aim to achieve low latency and high bandwidth through on-chip Network Interface (NI) support. This paper proposes and evaluates network interface architectures for tiled manycore SoCs for in-memory rack-scale computing. Our results indicate that a careful splitting of NI functionality per chip tile and at the chip's edge along a NOC dimension enables a rack-scale architecture to optimize for both latency and bandwidth. Our best manycore NI architecture achieves latencies within 3% of an idealized hardware NUMA and efficiently uses the full bisection bandwidth of the NOC, without changing the on-chip coherence protocol or the core's microarchitecture.

References

  1. D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson, and M. H. Lipasti, "Achieving Predictable Performance Through Better Memory Controller Placement in Many-Core CMPs," in ACM SIGARCH Computer Architecture News, vol. 37, no. 3, 2009, pp. 451--461. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. A. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung, "The MIT Alewife Machine: Architecture and Performance," in Proceedings of the 22nd International Symposium on Computer Architecture (ISCA), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Anandtech, "Haswell: Up to 128MB On-Package Cache." {Online}. Available: http://www.anandtech.com/show/6277/haswell-up-to-128mb-onpackage-cache-ulv-gpu-performance-estimates.Google ScholarGoogle Scholar
  4. K. Asanović, "A Hardware Building Block for 2020 Warehouse-Scale Computers," USENIX FAST Keynote, 2014.Google ScholarGoogle Scholar
  5. B. Atikoglu, Y. Xu, E. Frachtenberg, S. Jiang, and M. Paleczny, "Workload Analysis of a Large-Scale Key-Value Store," in ACM SIGMETRICS Performance Evaluation Review, vol. 40, no. 1, 2012, pp. 53--64. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. A. Barroso, "Three Things to Save the Datacenter," ISSCC Keynote, 2014. {Online}. Available: http://www.theregister.co.uk/Print/2014/02/11/google_research_three_things_that_must_be_done_to_save_the_data_center_of_the_future/.Google ScholarGoogle Scholar
  7. N. L. Binkert, A. G. Saidi, and S. K. Reinhardt, "Integrated Network Interfaces for High-Bandwidth TCP/IP," in Proceedings of the 12th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Boston Limited, "Boston Limited Unveil Their Revolutionary Boston Viridis," 2011. {Online}. Available: http://www.boston.co.uk/press/2011/11/boston-limited-unveil-their-revolutionary-boston-viridis.aspx.Google ScholarGoogle Scholar
  9. Calxeda Inc., "ECX-1000 Technical Specifications," 2012. {Online}. Available: http://www.calxeda.com/ecx-1000-techspecs/.Google ScholarGoogle Scholar
  10. Cavium Networks, "Cavium Announces Availability of ThunderX™: Industry's First 48 Core Family of ARMv8 Workload Optimized Processors for Next Generation Data Center & Cloud Infrastructure," 2014. {Online}. Available: http://www.cavium.com/newsevents-Cavium-Announces-Availability-of-ThunderX.html.Google ScholarGoogle Scholar
  11. J. Chapin, M. Rosenblum, S. Devine, T. Lahiri, D. Teodosiu, and A. Gupta, "Hive: Fault Containment for Shared-Memory Multiprocessors," in Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Dhodapkar, G. Lauterbach, S. Li, D. Mallick, J. Bauman, S. Kanthadai, T. Kuzuhara, G. S. M. Xu, and C. Zhang, "SeaMicro SM10000-64 Server: Building Datacenter Servers Using Cell Phone Chips," in Proceedings of the 23rd IEEE HotChips Symposium, 2011.Google ScholarGoogle Scholar
  13. A. Dragojević, D. Narayanan, O. Hodson, and M. Castro, "FaRM: Fast Remote Memory," in Proceedings of the 11th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. EZchip Semiconductor Ltd., "EZchip Introduces TILE-Mx100World's Highest Core-Count ARM Processor Optimized for High-Performance Networking Applications," Press Release, 2015. {Online}. Available: http://www.tilera.com/News/PressRelease/?ezchip=97.Google ScholarGoogle Scholar
  15. B. Falsafi, A. R. Lebeck, S. K. Reinhardt, I. Schoinas, M. D. Hill, J. R. Larus, A. Rogers, and D. A. Wood, "Application-Specific Protocols for User-Level Shared Memory," in Proceedings of the 1994 ACM/IEEE Conference on Supercomputing (SC), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Falsafi and D. A. Wood, "Reactive NUMA: A Design for Unifying S-COMA and CC-NUMA," in Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Gantz and D. Reinsel, "The Digital Universe in 2020." IDC, 2012. {Online}. Available: http://www.emc.com/collateral/analyst-reports/idc-the-digital-universe-in-2020.pdf.Google ScholarGoogle Scholar
  18. E. Hagersten and M. Koster, "Wildfire: A scalable path for smps," in Proceedings of the Fifth International Symposium on High-Performance Computer Architecture (HPCA), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki, "Reactive NUCA: Near-Optimal Block Placement and Replication in Distributed Caches," in 36th International Symposium on Computer Architecture (ISCA), 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Heinlein, K. Gharachorloo, S. Dresser, and A. Gupta, "Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor," in ACM SIGPLAN Notices, vol. 29, no. 11, 1994, pp. 38--50. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Hewlett - Packard Development Company, "HP ProLiant m400 Server Cartridge," 2014. {Online}. Available: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=c04384048.Google ScholarGoogle Scholar
  22. Hewlett-Packard Development Company, "HP Moonshot System Family Guide," 2014. {Online}. Available: http://www8.hp.com/h20195/v2/GetDocument.aspx?docname=4AA4-6076ENW.Google ScholarGoogle Scholar
  23. R. Huggahalli, R. Iyer, and S. Tetrick, "Direct Cache Access for High Bandwidth Network I/O," in Proceedings of the 32nd International Symposium on Computer Architecture (ISCA), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Intel, "Moving Data with Silicon and Light," 2013. {Online}. Available: http://www.intel.com/content/www/us/en/research/intel-labs-silicon-photonics-research.html.Google ScholarGoogle Scholar
  25. J. Jeddeloh and B. Keeth, "Hybrid Memory Cube New DRAM Architecture Increases Density and Performance," in 2012 International Symposium on VLSI Technology (VLSIT), 2012.Google ScholarGoogle Scholar
  26. D. Kanter, "X-Gene 2 Aims Above Microservers," Microprocessor Report, vol. 28(9), pp. 20--24, 2014.Google ScholarGoogle Scholar
  27. R. Kessler and J. Schwarzmeier, "Cray T3D: A New Dimension for Cray Research," in Compcon Spring '93, Digest of Papers, 1993.Google ScholarGoogle ScholarCross RefCross Ref
  28. J. Kuskin, D. Ofelt, M. Heinrich, J. Heinlein, R. Simoni, K. Gharachorloo, J. Chapin, D. Nakahira, J. Baxter, M. Horowitz, A. Gupta, M. Rosenblum, and J. L. Hennessy, "The Stanford FLASH Multiprocessor," in Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Laudon and D. Lenoski, "The SGI Origin: A ccNUMA Highly Scalable Server," in Proceedings of the 24th International Symposium on Computer Architecture (ISCA), 1997, pp. 241--251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Lenoski, J. Laudon, K. Gharachorloo, W.-D. Weber, A. Gupta, J. L. Hennessy, M. Horowitz, and M. S. Lam, "The Stanford Dash Multiprocessor," IEEE Computer, vol. 25, no. 3, pp. 63--79, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. G. Liao, X. Zhu, and L. Bnuyan, "A New Server I/O Architecture for High Speed Networks," in Proceedings of the 17th International Symposium on High Performance Computer Architecture (HPCA), 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. K. Lim, D. Meisner, A. G. Saidi, P. Ranganathan, and T. F. Wenisch, "Thin Servers with Smart Pipes: Designing SoC Accelerators for Memcached," in Proceedings of the 40th Annual International Symposium on Computer Architecture (ISCA), 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. Lotfi-Kamran, B. Grot, and B. Falsafi, "NOC-Out: Microarchitecting a Scale-Out Processor," in Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos, Y. O. Koçberber, J. Picorel, A. Adileh, D. Jevdjic, S. Idgunji, E. Özer, and B. Falsafi, "Scale-Out Processors," in Proceedings of the 39th International Symposium on Computer Architecture (ISCA), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, "Pregel: A System for Large-Scale Graph Processing," in Proceedings of the ACM International Conference on Management of Data (SIGMOD), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Mellanox Corp., "ConnectX-3 Pro Product Brief," 2012. {Online}. Available: http://www.mellanox.com/related-docs/prod_adapter_cards/PB_ConnectX-3_Pro_Card_EN.pdf.Google ScholarGoogle Scholar
  37. S. S. Mukherjee, B. Falsafi, M. D. Hill, and D. A. Wood, "Coherent Network Interfaces for Fine-Grain Communication," in Proceedings of the 23rd International Symposium on Computer Architecture (ISCA), 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Novakovic, A. Daglis, E. Bugnion, B. Falsafi, and B. Grot, "Scale-Out NUMA," in Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. D. N. Paolo Costa, Hitesh Ballani, "Rethinking the Network Stack for Rack-Scale Computers," in Hot Topics in Cloud Computing (HotCloud). USENIX, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. S. K. Reinhardt, J. R. Larus, and D. A. Wood, "Tempest and Typhoon: User-Level Shared Memory," in Proceedings of the 21st International Symposium on Computer Architecture (ISCA), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. L. Scott and G. M. Thorson, "The Cray T3E Network: Adaptive Routing in a High Performance 3D Torus," in Hot Interconnects, 1996.Google ScholarGoogle Scholar
  42. D. Seo, A. Ali, W.-T. Lim, N. Rafique, and M. Thottethodi, "Near-Optimal Worst-Case Throughput Routing for Two-Dimensional Mesh Networks," in Proceedings of the 32nd Annual International Symposium on Computer Architecture (ISCA), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. W. Shi, E. Collins, and V. Karamcheti, "Modeling Object Characteristics of Dynamic Web Content," Journal of Parallel and Distributed Computing, vol. 63, no. 10, pp. 963--980, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe, "Reunion: Complexity-Effective Multicore Redundancy," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. B. Towles, J. Grossman, B. Greskamp, and D. E. Shaw, "Unifying On-Chip and Inter-Node Switching within the Anton 2 Network," in Proceedings of the 41st International Symposium on Computer Architecture (ISCA), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ailamaki, B. Falsafi, and J. C. Hoe, "SimFlex: Statistical Sampling of Computer System Simulation," IEEE Micro, vol. 26, pp. 18--31, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Manycore network interfaces for in-memory rack-scale computing

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
            June 2015
            768 pages
            ISBN:9781450334020
            DOI:10.1145/2749469

            Copyright © 2015 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 June 2015

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate543of3,203submissions,17%

            Upcoming Conference

            ISCA '24

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader