research-article

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory

Authors:
Patrick Siegl

TU Braunschweig, Abteilung Technische, Informatik, E.I.S., Braunschweig, Germany

TU Braunschweig, Abteilung Technische, Informatik, E.I.S., Braunschweig, Germany
View Profile

,
Rainer Buchty

TU Braunschweig, Abteilung Technische, Informatik, E.I.S., Braunschweig, Germany

TU Braunschweig, Abteilung Technische, Informatik, E.I.S., Braunschweig, Germany
View Profile

,
Mladen Berekovic

TU Braunschweig, Abteilung Technische, Informatik, E.I.S., Braunschweig, Germany

TU Braunschweig, Abteilung Technische, Informatik, E.I.S., Braunschweig, Germany
View Profile

MEMSYS '16: Proceedings of the Second International Symposium on Memory SystemsOctober 2016Pages 295–308https://doi.org/10.1145/2989081.2989087

Published:03 October 2016Publication History

MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

Pages 295–308

ABSTRACT

A major shift from compute-centric to data-centric computing systems can be perceived, as novel big data workloads like cognitive computing and machine learning strongly enforce embarrassingly parallel and highly efficient processor architectures. With Moore's law having surrendered, innovative architectural concepts as well as technologies are urgently required, to enable a path for tackling exascale and beyond -- even though current computing systems face the inevitable instruction-level parallelism, power, memory, and bandwidth walls.

As part of any computing system, the general perception of memories depicts unreliability, power hungriness and slowness, resulting in a future prospective bottleneck. The latter being an outcome of a pin limitation derived by packaging constraints, an unexploited tremendous row bandwidth is determinable, which off-chip diminishes to a bare minimum. Building upon a shift towards data-centric computing systems, the near-memory processing concept seems to be most promising, since power efficiency and computing performance increase by co-locating tasks on bandwidth-rich in-memory processing units, whereas data motion mitigates by the avoidance of entire memory hierarchies. By considering the umbrella of near-data processing as the urgent required breakthrough for future computing systems, this survey presents its derivations with a special emphasis on Processing-In-Memory (PIM), highlighting historical achievements in technology as well as architecture while depicting its advantages and obstacles.

References

The road to the amd "fiji" gpu. Taiwan, September 2015.Google Scholar
T. Agerwala. Data centric systems: The next paradigm in computing. Parallel Processing (ICPP), 2014 43rd International Conference on, Sept 2014.Google Scholar
J. Ahn, S. Yoo, O. Mutlu, and K. Choi. Pim-enabled instructions: A low-overhead, locality-aware processing-in-memory architecture. In Proceedings of the 42Nd Annual International Symposium on Computer Architecture, ISCA '15, pages 336--348, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
Y. Aimoto, T. Kimura, Y. Yabe, H. Heiuchi, Y. Nakazawa, M. Motomura, T. Koga, Y. Fujita, M. Hamada, T. Tanigawa, H. Nobusawa, and K. Koyama. A 7.68 gips 3.84 gb/s 1w parallel image processing ram integrating a 16 mb dram and 128 processors. In Solid-State Circuits Conference, 1996. Digest of Technical Papers. 42nd ISSCC., 1996 IEEE International, pages 372--373, Feb 1996.Google ScholarCross Ref
M. M. S. Aly, M. Gao, G. Hills, C.-S. Lee, G. Pitner, M. M. Shulaker, T. F. Wu, M. Asheghi, J. Bokor, F. Franchetti, K. E. Goodson, C. Kozyrakis, I. Markov, K. Olukotun, L. Pileggi, E. Pop, J. Rabaey, C. Re, H. S. P. Wong, and S. Mitra. Energy-efficient abundant-data computing: The n3xt 1,000x. Computer, 48(12):24--33, Dec 2015. Google ScholarDigital Library
R. Balasubramonian, J. Chang, T. Manning, J. H. Moreno, R. Murphy, R. Nair, and S. Swanson. Near-data processing: Insights from a micro-46 workshop. Micro, IEEE, 34(4):36--42, July 2014.Google ScholarCross Ref
P. F. Baumeister, H. Boettiger, J. R. Brunheroto, T. Hater, T. Maurer, A. Nobile, and D. Pleiter. High Performance Computing: 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, Proceedings, chapter Accelerating LBM and LQCD Application Kernels by In-Memory Processing, pages 96--112. Springer International Publishing, 2015.Google Scholar
O. A. R. Board. Openmp application programming interface. Technical report, Nov 2015.Google Scholar
I. Bolsens. 2.5d ics: Just a stepping stone or a long term alternative to 3d? 2011.Google Scholar
P. Bose. The power of communication - trends, challenges (and accounting issues). Discussion as NSF WETI Workshop, Feb 2012.Google Scholar
J. B. Brockman, S. Thoziyoor, S. K. Kuntz, and P. M. Kogge. A low cost, multithreaded processing-in-memory system. In Proceedings of the 3rd Workshop on Memory Performance Issues: In Conjunction with the 31st International Symposium on Computer Architecture, WMPI '04, pages 16--22, New York, NY, USA, 2004. ACM. Google ScholarDigital Library
E. A. Burton, G. Schrom, F. Paillet, J. Douglas, W. J. Lambert, K. Radhakrishnan, and M. J. Hill. Fivr --- fully integrated voltage regulators on 4th generation intel® core™ socs. In Applied Power Electronics Conference and Exposition (APEC), 2014 Twenty-Ninth Annual IEEE, pages 432--439, March 2014.Google Scholar
D. W. Chang, G. Byun, H. Kim, M. Ahn, S. Ryu, N. S. Kim, and M. Schulte. Reevaluating the latency claims of 3d stacked memories. In Design Automation Conference (ASP-DAC), 2013 18th Asia and South Pacific, pages 657--662, Jan 2013.Google ScholarCross Ref
L. Chua. Memristor-the missing circuit element. IEEE Transactions on Circuit Theory, 18(5):507--519, Sep 1971.Google ScholarCross Ref
T. Coughlin. Crossing the chasm to new solid-state storage architectures. IEEE Consumer Electronics Magazine, 5(1):133--142, Jan 2016.Google ScholarCross Ref
A. Cristal, D. Ortega, J. Llosa, and M. Valero. Out-of-order commit processors. In Software, IEE Proceedings-, pages 48--59, Feb 2004. Google ScholarDigital Library
A. Cristal, O. J. Santana, F. Cazorla, M. Galluzzi, T. Ramirez, M. Pericas, and M. Valero. Kilo-instruction processors: overcoming the memory wall. Micro, IEEE, 25(3):48--57, May 2005. Google ScholarDigital Library
K. Y. David Patterson, Tom Anderson. A case for intelligent dram: Iram. Palo Alto CA., August 1996.Google Scholar
G. Davidson, K. Boyack, R. Zacharski, S. Helmreich, and C. J.R. Data-centric computing with the netezza architecture. Technical report sand 2006-3640, Sandia National Laboratories, April 2006.Google Scholar
W. R. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, A. M. Sule, M. Steer, and P. D. Franzon. Demystifying 3d ics: the pros and cons of going vertical. Design Test of Computers, IEEE, 22(6):498--510, Nov 2005. Google ScholarDigital Library
M. F. Deering, M. G. Lavelle, and S. A. Schlapp. A cached vram for 3d graphics. HotChips VI, 1994.Google Scholar
M. F. Deering, S. A. Schlapp, and M. G. Lavelle. Fbram: A new form of memory optimized for 3d graphics. In Proceedings of the 21st Annual Conference on Computer Graphics and Interactive Techniques, SIGGRAPH '94, pages 167--174, New York, NY, USA, 1994. ACM. Google ScholarDigital Library
R. H. Dennard. Field-effect transistor memory, jun 4 1968. US Patent 3,387,286.Google Scholar
M. Deo. Enabling next-generation platforms using altera's 3d system-in-package technology. Whitepaper, Altera, June 2015.Google Scholar
M. Deo, J. Schulz, and L. Brown. Stratix 10 mx devices solve the memory bandwidth challenge. Whitepaper, Altera, now part of Intel, May 2016.Google Scholar
P. Dlugosch, D. Brown, P. Glendenning, M. Leventhal, and H. Noyes. An efficient and scalable semiconductor architecture for parallel automata processing. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, 25(12):3088--3098, Dec 2014.Google ScholarCross Ref
J. Easton. In-memory computing - next generation technologies. November 2013.Google Scholar
R. Egawa, M. Sato, J. Tada, and H. Kobayashi. Vertically integrated processor and memory module design for vector supercomputers. In 3DIC, pages 1--6, 2013.Google Scholar
D. G. Elliott, W. M. Snelgrove, and M. Stumm. Computational ram: A memory-simd hybrid and its application to dsp. In Custom Integrated Circuits Conference, 1992., Proceedings of the IEEE 1992, pages 30--6, May 1992.Google ScholarCross Ref
D. G. Elliott, M. Stumm, W. M. Snelgrove, C. Cojocaru, and R. McKenzie. Computational ram: implementing processors in memory. Design Test of Computers, IEEE, 16(1):32--41, Jan 1999. Google ScholarDigital Library
Z. Fang, L. Zhang, J. B. Carter, A. Ibrahim, and M. A. Parker. Active memory operations. In Proceedings of the 21st Annual International Conference on Supercomputing, ICS '07, pages 232--241, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
Z. Fang, L. Zhang, J. B. Carter, S. A. McKee, A. Ibrahim, M. A. Parker, and X. Jiang. Active memory controller. The Journal of Supercomputing, 62(1):510--549, 2012. Google ScholarDigital Library
A. Farmahini-Farahani, J. H. Ahn, K. Morrow, and N. S. Kim. Drama: An architecture for accelerated processing near memory. Computer Architecture Letters, 14(1):26--29, Jan 2015.Google ScholarCross Ref
B. G. Fitch, A. Rayshubskiy, M. C. Pitman, T. J. C. Ward, and R. S. Germain. Using the active storage fabrics model to address petascale storage challenges. In Proceedings of the 4th Annual Workshop on Petascale Data Storage, PDSW '09, pages 47--54, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
H. Fuchs and J. Poulton. Pixel-planes: A vlsi-oriented design for a raster graphics engine. In VLSI-DESIGN, 81(3), pages 20--28, 1981.Google Scholar
M. Gao, G. Ayers, and C. Kozyrakis. Practical near-data processing for in-memory analytics frameworks. In Proceedings of the 24th International Conference on Parallel Architectures and Compilation, PACT '15, New York, NY, USA, 2015. ACM. Google ScholarDigital Library
A. Gara. The long term impact of codesign. In High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion:, pages 2212--2246, Nov 2012. Google ScholarDigital Library
M. Gokhale, B. Holmes, and K. Iobst. Processing in memory: The terasys massively parallel pim array. Computer, 28(4):23--31, Apr 1995. Google ScholarDigital Library
M. H. Hajkazemi, M. K. Tavana, and H. Homayoun. Wide i/o or lpddr?: Exploration and analysis of performance, power and temperature trade-offs of emerging dram technologies in embedded mpsocs. In Computer Design (ICCD), 2015 33rd IEEE International Conference on, pages 62--69, Oct 2015. Google ScholarDigital Library
P. Hammarlund, A. J. Martinez, A. A. Bajwa, D. L. Hill, E. Hallnor, H. Jiang, M. Dixon, M. Derr, M. Hunsaker, R. Kumar, R. B. Osborne, R. Rajwar, R. Singhal, R. D'Sa, R. Chappell, S. Kaushik, S. Chennupaty, S. Jourdan, S. Gunther, T. Piazza, and T. Burton. Haswell: The fourth-generation intel core processor. Micro, IEEE, 34(2):6--20, Mar 2014.Google ScholarCross Ref
R. A. Haring, M. Ohmacht, T. W. Fox, M. K. Gschwind, D. L. Satterfield, K. Sugavanam, P. W. Coteus, P. Heidelberger, M. A. Blumrich, R. W. Wisniewski, A. Gara, G. L. T. Chiu, P. A. Boyle, N. H. Chist, and C. Kim. The ibm blue gene/q compute chip. Micro, IEEE, 32(2):48--60, March 2012. Google ScholarDigital Library
N. Hemsoth. The tiny chip that could disrupt exascale computing. The Next Platform, March 2015. http://www.nextplatform.com/2015/03/12/the-little-chip-that-could-disrupt-exascale-computing/.Google Scholar
J. L. Hennessy and D. A. Patterson. Computer Architecture, Fourth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2006. Google ScholarDigital Library
J. L. Hennessy and D. A. Patterson. Computer Architecture, Fifth Edition: A Quantitative Approach. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 5th edition, 2011. Google ScholarDigital Library
J. Hruska. Beyond ddr4: The differences between wide i/o, hbm, and hybrid memory cube. Online, Jan 2015. ExtremTech.Google Scholar
Hybrid Memory Cube Consortium. Hybrid memory cube specification 2.1. Technical report, 2014.Google Scholar
Intel. 2015 annual report. Form 10-k, March 2016.Google Scholar
Intel. Intel developer forum (idf16). Shenzhen, April 2016.Google Scholar
ISSCC. Isscc 2014 trends. Technical report, 2014.Google Scholar
J. Jeddeloh and B. Keeth. Hybrid memory cube new dram architecture increases density and performance. In VLSI Technology (VLSIT), 2012 Symposium on, pages 87--88, June 2012.Google ScholarCross Ref
K. Jothi, M. Sharafeddine, and H. Akkary. Simultaneous continual flow pipeline architecture. In Computer Design (ICCD), 2011 IEEE 29th International Conference on, pages 127--134, Oct 2011. Google ScholarDigital Library
A. Kagi, J. R. Goodman, and D. Burger. Memory bandwidth limitations of future microprocessors. In Computer Architecture, 1996 23rd Annual International Symposium on, pages 78--78, May 1996. Google ScholarDigital Library
J. Kahle. The cell processor architecture. In Microarchitecture, 2005. MICRO-38. Proceedings. 38th Annual IEEE/ACM International Symposium on, pages 3--3, Nov 2005. Google ScholarDigital Library
Y. Kang, W. Huang, S.-M. Yoo, D. Keen, Z. Ge, V. Lam, P. Pattnaik, and J. Torrellas. Flexram: toward an advanced intelligent memory system. In Computer Design, 1999. (ICCD '99) International Conference on, pages 192--201, 1999. Google ScholarDigital Library
W. H. Kautz. Cellular logic-in-memory arrays. Computers, IEEE Transactions on, C-18(8):719--727, Aug 1969. Google ScholarDigital Library
S. Kaxiras, R. Sugumar, and J. Schwarzmeier. Distributed vector architecture: Beyond a single vector-iram. In In First Workshop on Mixing Logic and DRAM: Chips that Compute and Remember, 1997.Google Scholar
C. Keable. Data centric deep computing (dc2), 2012.Google Scholar
M. J. Khurshid and M. Lipasti. Data compression for thermal mitigation in the hybrid memory cube. In Computer Design (ICCD), 2013 IEEE 31st International Conference on, pages 185--192, Oct 2013.Google ScholarCross Ref
Y. Kim, T.-D. Han, S.-D. Kim, and S.-B. Yang. An effective memory-processor integrated architecture for computer vision. In Parallel Processing, 1997., Proceedings of the 1997 International Conference on, pages 266--269, Aug 1997. Google ScholarDigital Library
Y. Kim and Y. H. Song. Analysis of thermal behavior for 3d integration of dram. In Consumer Electronics (ISCE 2014), The 18th IEEE International Symposium on, pages 1--2, June 2014.Google Scholar
M. B. Kleiner, S. A. Kuhn, P. Ramm, and W. Weber. Performance improvement of the memory hierarchy of risc-systems by application of 3-d technology. Components, Packaging, and Manufacturing Technology, Part B: Advanced Packaging, IEEE Transactions on, 19(4):709--718, Nov 1996.Google Scholar
G. Knittel and A. Schilling. Eliminating the z-buffer bottleneck. In European Design and Test Conference, 1995. ED TC 1995, Proceedings., pages 12--16, Mar 1995. Google ScholarDigital Library
G. Knittel, A. Schilling, and W. Straßer. High Performance Computing for Computer Graphics and Visualisation: Proceedings of the International Workshop on High Performance Computing for Computer Graphics and Visualisation, Swansea 3-4 July 1995, chapter GRAMMY: High Performance Graphics Using Graphics Memories, pages 33--48. Springer London, London, 1996.Google Scholar
P. Kogge. Exascale computing study: Technology challenges in achieving exascale systems. Technical Report TR-2008-13, University of Note Dame, 2008.Google Scholar
P. M. Kogge. Execube-a new architecture for scaleable mpps. In Parallel Processing, 1994. Vol. 1. ICPP 1994. International Conference on, volume 1, pages 77--84, Aug 1994. Google ScholarDigital Library
P. M. Kogge. Exploring the possible past futures of a single part type multi-core pim chip. In Innovative Architecture for Future Generation High Performance (IWIA), 2010 International Workshop on, pages 30--38, Jan 2010. Google ScholarDigital Library
P. M. Kogge. Updating the Energy Model for Future Exascale Systems, chapter High Performance Computing: 30th International Conference, ISC High Performance 2015, Frankfurt, Germany, July 12-16, 2015, Proceedings, pages 323--339. Springer International Publishing, Cham, 2015.Google Scholar
P. M. Kogge, J. B. Brockman, T. Sterling, and G. Gao. Processing in memory: Chips to petaflops. In In Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA '97, 1997.Google Scholar
P. M. Kogge, P. La Fratta, and M. Vance. Facing the exascale energy wall. In Innovative Architecture for Future Generation High Performance (IWIA), 2010 International Workshop on, pages 51--58, Jan 2010. Google ScholarDigital Library
P. M. Kogge, T. Sunaga, H. Miyataka, K. Kitamura, and E. Retter. Combined dram and logic chip for massively parallel systems. In Advanced Research in VLSI, 1995. Proceedings., Sixteenth Conference on, pages 4--16, Mar 1995. Google ScholarDigital Library
A. Kopser and D. Vollrath. Overview of the next generation cray xmt. In 53rd Cray User Group meeting, CUG 2011, Fairbanks, Alaska, 2011.Google Scholar
A. Kugler, G. Knittel, A. G. Schilling, and W. Straßer. High-performance texture mapping architectures. In Proceedings of the 6th OMI Annual Conference on Embedded Microprocessor Systems, pages 189--198. IOS Press, sep 1996.Google Scholar
G. Kyriazis. Heterogeneous system architecture: A technical review. Whitepaper, AMD, August 2012.Google Scholar
E. S. Larsen and D. McAllister. Fast matrix multiplies using graphics hardware. In Proceedings of the 2001 ACM/IEEE Conference on Supercomputing, SC '01, pages 55--55, New York, NY, USA, 2001. ACM. Google ScholarDigital Library
D. U. Lee, K. W. Kim, K. W. Kim, H. Kim, J. Y. Kim, Y. J. Park, J. H. Kim, D. S. Kim, H. B. Park, J. W. Shin, and et. al. A 1.2v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective microbump i/o test methods using 29nm process and tsv. In Solid-State Circuits Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, pages 432--433, Feb 2014.Google ScholarCross Ref
D. U. Lee, K. W. Kim, K. W. Kim, K. S. Lee, S. J. Byeon, J. H. Kim, J. H. Cho, J. Lee, and J. H. Chun. A 1.2v 8gb 8-channel 128gb/s high-bandwidth memory (hbm) stacked dram with effective i/o test circuits. Solid-State Circuits, IEEE Journal of, 50(1):191--203, Jan 2015.Google ScholarCross Ref
C. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari. Bridging the processor-memory performance gap with 3d ic technology. Design Test of Computers, IEEE, 22(6):556--564, Nov 2005. Google ScholarDigital Library
G. Loh, N. Jayasena, M. Oskin, M. Nutter, D. Roberts, M. Meswani, D. P. Zhang, and M. Ignatowski. A processing in memory taxonomy and a case for studying fixed-function pim. In WoNDP: 1st Workshop on Near-Data Processing in conjunction with the 46th IEEE/ACM International Symposium on Microarchitecture (MICRO-46), 2013.Google Scholar
K. Mai, T. Paaske, N. Jayasena, R. Ho, W. J. Dally, and M. Horowitz. Smart memories: a modular reconfigurable architecture. In Computer Architecture, 2000. Proceedings of the 27th International Symposium on, pages 161--171, June 2000. Google ScholarDigital Library
M. Martonosi. Power-aware computing: Then, now, and into the future. 2014.Google Scholar
J. Menon, L. De Carli, V. Thiruvengadam, K. Sankaralingam, and C. Estan. Memory processing units, 2014.Google Scholar
Micron. 2016 analyst conference positioned for success, Feb 2016. http://files.shareholder.com/downloads/ABEA-45YXOQ/1517834575x0x875021/4BEAA02E-BBC2-402C-A51D-B3B2C6B8C3D4/Winter_Analyst_Day_2016.pdf.Google Scholar
R. C. Minnick. A survey of microcellular research. J. ACM, 14(2):203--241, apr 1967. Google ScholarDigital Library
R. C. Minnick, J. Goldberg, M. W. Green, W. H. Kautz, R. A. Short, H. S. Stone, and M. Yoeli. Cellular arrays for logic and storage. Final rept., Stanford Research Institute, Menlo Park, Calif., April 1966.Google Scholar
M. Minutoli, S. Kuntz, A. Tumeo, and P. Kogge. Implementing radix sort on emu 1. In In the 3rd Workshop on Near-Data Processing (WoNDP), Waikiki, Hawaii, 2015.Google Scholar
Mitsubishi, Electronic Device Group. 3d-ram: Frame buffer memory for high-performance 3d graphics. Data book, Mitsubishi, 1996.Google Scholar
N. Miura, Y. Koizumi, E. Sasaki, Y. Take, H. Matsutani, T. Kuroda, H. Amano, R. Sakamoto, M. Namiki, K. Usami, M. Kondo, and H. Nakamura. A scalable 3d heterogeneous multi-core processor with inductive-coupling thruchip interface. In Cool Chips XVI (COOL Chips), 2013 IEEE, pages 1--3, April 2013.Google ScholarCross Ref
G. E. Moore. Cramming more components onto integrated circuits. Proceedings of the IEEE, 86(1):82--85, Jan 1998.Google ScholarCross Ref
T. P. Morgan. Putting mroe brains in the network frees up compute. The Next Platform, June 2016. http://www.nextplatform.com/2016/06/08/putting-brains-network-frees-compute/.Google Scholar
M. Motoyoshi. Through-silicon via (tsv). Proceedings of the IEEE, 97(1):43--48, Jan 2009.Google ScholarCross Ref
C. Muller-Schloer, F. Geerinckx, and B. Stanford-Smit, editors. Embedded Microprocessor Systems. IOS Press, Amsterdam, The Netherlands, 1st edition, 1996. Google ScholarDigital Library
R. Nair, S. F. Antao, C. Bertolli, P. Bose, J. R. Brunheroto, T. Chen, C. Cher, C. H. A. Costa, J. Doi, C. Evangelinos, B. M. Fleischer, T. W. Fox, D. S. Gallo, L. Grinberg, J. A. Gunnels, A. C. Jacob, P. Jacob, H. M. Jacobson, T. Karkhanis, C. Kim, J. H. Moreno, J. K. O'Brien, M. Ohmacht, Y. Park, D. A. Prener, B. S. Rosenburg, K. D. Ryu, O. Sallenave, M. J. Serrano, P. D. M. Siegl, K. Sugavanam, and Z. Sura. Active memory cube: A processing-in-memory architecture for exascale systems. IBM Journal of Research and Development, 59(2/3):17--1, March 2015.Google ScholarDigital Library
Nvidia. Nvidia tesla p100 - the most advanced datacenter accelerator ever built featuring pascal gp100, the world's fastest gpu. Whitepaper, 2016.Google Scholar
A. Olofsson, T. Nordström, and Z. Ul-Abdin. Kickstarting high-performance energy-efficient manycore architectures with epiphany. In Signals, Systems and Computers, 2014 48th Asilomar Conference on, pages 1719--1726, Nov 2014.Google Scholar
D. Patterson, T. Anderson, N. Cardwell, R. Fromm, K. Keeton, C. Kozyrakis, R. Thomas, and K. Yelick. Intelligent ram (iram): chips that remember and compute. In Solid-State Circuits Conference, 1997. Digest of Technical Papers. 43rd ISSCC., 1997 IEEE International, pages 224--225, Feb 1997.Google ScholarCross Ref
D. Patterson, K. Asanovic, A. Brown, R. Fromm, J. Golbus, B. Gribstad, K. Keeton, C. Kozyrakis, D. Martin, S. Perissakis, R. Thomas, N. Treuhaft, and K. Yelick. Intelligent ram (iram): the industrial setting, applications, and architectures. In Computer Design: VLSI in Computers and Processors, 1997. ICCD '97. Proceedings., 1997 IEEE International Conference on, pages 2--7, Oct 1997. Google ScholarDigital Library
D. A. Patterson. Latency lags bandwith. Commun. ACM, 47(10):71--75, oct 2004. Google ScholarDigital Library
D. A. Patterson. Future of computer architecture. Berkeley EECS Annual Research Symposium (BEARS), Feb 2006.Google Scholar
J. T. Pawlowski. Hybrid memory cube (hmc). HOT CHIPS 23, August 2011.Google ScholarCross Ref
F. J. Pollack. New microarchitecture challenges in the coming generations of cmos process technologies (keynote address)(abstract only). In Proceedings of the 32Nd Annual ACM/IEEE International Symposium on Microarchitecture, MICRO 32, page 2, Washington, DC, USA, 1999. IEEE Computer Society. Google ScholarDigital Library
P. Ranganathan. From microprocessors to nanostores: Rethinking data-centric systems. Computer, 44(1):39--48, Jan 2011. Google ScholarDigital Library
S. F. Reddaway. Dap - a distributed array processor. SIGARCH Comput. Archit. News, 2(4):61--65, dec 1973. Google ScholarDigital Library
E. Riedel, C. Faloutsos, G. A. Gibson, and D. Nagle. Active disks for large-scale data processing. Computer, 34(6):68--74, jun 2001. Google ScholarDigital Library
B. M. Rogers, A. Krishna, G. B. Bell, K. Vu, X. Jiang, and Y. Solihin. Scaling the bandwidth wall: Challenges in and avenues for cmp scaling. SIGARCH Comput. Archit. News, 37(3):371--382, jun 2009. Google ScholarDigital Library
D. F. Rogers and R. Earnshaw. State of the Art in Computer Graphics: Visualization and Modeling. Springer Publishing Company, Incorporated, 1st edition, 2012. Google ScholarDigital Library
K. Sakuma, P. Andry, K. Sueoka, R. Horton, S. Wright, Y. Oyama, B. Webb, C. Patel, B. Dang, C. Tsang, E. Sprogis, R. Polastre, and J. Knickerbocker. Die cavity integration technology for through-silicon-vias stacking. San Diego, CA, Sept 2008.Google Scholar
SanDisk. Sandisk and hp launch partnership to create memory-driven computing solutions, Oct 2015. https://www.sandisk.com/about/media-center/press-releases/2015/sandisk-and-hp-launch-partnership.Google Scholar
A. Saulsbury, F. Pong, and A. Nowatzyk. Missing the memory wall: The case for processor/memory integration. In Proceedings of the 23rd Annual International Symposium on Computer Architecture, ISCA '96, pages 90--101, New York, NY, USA, 1996. ACM. Google ScholarDigital Library
M. Scrbak, M. Islam, K. Kavi, M. Ignatowski, and N. Jayasena. Processing-in-memory: Exploring the design space. In L. M. P. Pinho, W. Karl, A. Cohen, and U. Brinkschulte, editors, Architecture of Computing Systems -- ARCS 2015, volume 9017, chapter Lecture Notes in Computer Science, pages 43--54. Springer International Publishing, 2015.Google Scholar
Seagate. Seagate demonstrates fastest-ever ssd flash drive. Press Release, March 2016. http://www.seagate.com/de/de/about-seagate/news/seagate-demonstrates-fastest-ever-ssd-flash-drive-pr/.Google Scholar
T. Semiconductor. Tezzaron unveils 3d sram, January 2005.Google Scholar
I. T. R. F. Semiconductors. Itrs 2.0 system integration whitepaper. Technical report, Dec 2014.Google Scholar
G. Shainer. Intelligent networks: A new co-processor emerges. The Next Platform, March 2016. http://www.nextplatform.com/2016/03/02/intelligent-networks-a-new-co-processor-emerges/.Google Scholar
J. M. Shalf and R. Leland. Computing beyond moore's law. Computer, 48(12):14--23, Dec 2015. Google ScholarDigital Library
T. Shimizu, J. Korematu, M. Satou, H. Kondo, S. Iwata, K. Sawai, and et. al. A multimedia 32b risc microprocessor with 16mb dram. In Solid-State Circuits Conference, 1996. Digest of Technical Papers. 42nd ISSCC., 1996 IEEE International, pages 216--217, Feb 1996.Google Scholar
P. Siegl, R. Buchty, and M. Berekovic. Revealing potential performance improvements by utilizing hybrid work-sharing for resource-intensive seismic applications. In Proceedings of the 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing. IEEE Computer Society, Mar 2015. Google ScholarDigital Library
P. Stanley-Marbell, V. C. Cabezas, and R. P. Luijten. Pinned to the walls - impact of packaging and application properties on the memory and power walls. In Low Power Electronics and Design (ISLPED) 2011 International Symposium on, pages 51--56, Aug 2011. Google ScholarDigital Library
W. J. Starke, J. Stuecheli, D. M. Daly, J. S. Dodson, F. Auernhammer, P. M. Sagmeister, G. L. Guthrie, C. F. Marino, M. Siegel, and B. Blaner. The cache and memory subsystems of the ibm power8 processor. IBM Journal of Research and Development, 59(1):3--1, Jan 2015.Google ScholarDigital Library
H. S. Stone. A logic-in-memory computer. Computers, IEEE Transactions on, C-19(1):73--78, Jan 1970. Google ScholarDigital Library
T. Thorolfsson, N. Moezzi-Madani, and P. D. Franzon. A low power 3d integrated fft engine using hypercube memory division. In ISLPED, ISLPED '09, pages 231--236, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
S. Thoziyoor, J. Brockman, and D. Rinzler. Pim lite: A multithreaded processor-in-memory prototype. In Proceedings of the 15th ACM Great Lakes Symposium on VLSI, GLSVLSI '05, pages 64--69, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
T. Trader. Mellanox touts arrival of intelligent interconnect. HPCwire, November 2015. http://www.hpcwire.com/2015/11/16/mellanox-touts-arrival-of-intelligent-interconnect/.Google Scholar
J. von Neumann. First draft of a report on the edvac. Annals of the History of Computing, IEEE, 15(4):27--75, 1993. Google ScholarDigital Library
S. Vongehr and X. Meng. The missing memristor has not been found. In Nature Scientific Reports, volume 5. Macmillan Publishers Limited, 2015.Google Scholar
M. M. WALDROP. More than moore. NATURE, 530:144--147, feb 2016.Google Scholar
D. L. Weaver and T. Germond. The sparc architecture manual, version 9. Technical report, SPARC International, Inc., San Jose, California, 1994.Google Scholar
S. Williams, A. Waterman, and D. Patterson. Roofline: An insightful visual performance model for multicore architectures. Commun. ACM, 52(4):65--76, apr 2009. Google ScholarDigital Library
W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications of the obvious. SIGARCH Comput. Archit. News, 23(1):20--24, mar 1995. Google ScholarDigital Library
Y. Xie. Future memory and interconnect technologies. In Design, Automation Test in Europe Conference Exhibition (DATE), 2013, pages 964--969, March 2013. Google ScholarDigital Library
D. Zhang, N. Jayasena, A. Lyashevsky, J. L. Greathouse, L. Xu, and M. Ignatowski. Top-pim: Throughput-oriented programmable processing in memory. In Proceedings of the 23rd International Symposium on High-performance Parallel and Distributed Computing, HPDC '14, pages 85--98, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
D. P. Zhang, N. Jayasena, A. Lyashevsky, J. Greathouse, M. Meswani, M. Nutter, and M. Ignatowski. A new perspective on processing-in-memory architecture design. In Proceedings of the ACM SIGPLAN Workshop on Memory Systems Performance and Correctness, MSPC '13, pages 7--1, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
Q. Zhu, B. Akin, H. E. Sumbul, F. Sadi, J. C. Hoe, L. Pileggi, and F. Franchetti. A 3d-stacked logic-in-memory accelerator for application-specific data intensive computing. In 3DIC, pages 1--7, Oct 2013.Google Scholar

Recommendations

Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation
GLSVLSI '19: Proceedings of the 2019 on Great Lakes Symposium on VLSI

Today's systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: 1) data access from memory is already a ...
Read More
Processing data where it makes sense: Enabling in-memory computation
Abstract
Today’s systems are overwhelmingly designed to move data to computation. This design choice goes directly against at least three key trends in systems that cause performance, scalability and energy bottlenecks: (1) data access from ...
Read More
A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System

The Data-Intensive Architecture (DIVA) system employs Processing-In-Memory (PIM) chips as smart-memory coprocessors. This architecture exploits inherent memory bandwidth both on chip and across the system to target several classes of bandwidth-limited ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems
October 2016
463 pages
ISBN:9781450343053
DOI:10.1145/2989081
General Chair:
Bruce Jacob
University of Maryland
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 3 October 2016
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bandwidth wall
memory wall
near-data processing
processing-in-memory
Qualifiers
- research-article
- Research
- Refereed limited
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 37
  Total Citations
  View Citations
- 1,634
  Total Downloads
- Downloads (Last 12 months)136
- Downloads (Last 6 weeks)12
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory

MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

ABSTRACT

References

Cited By

Recommendations

Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation

Processing data where it makes sense: Enabling in-memory computation

A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Data-Centric Computing Frontiers: A Survey On Processing-In-Memory

MEMSYS '16: Proceedings of the Second International Symposium on Memory Systems

ABSTRACT

References

Cited By

Recommendations

Processing Data Where It Makes Sense in Modern Computing Systems: Enabling In-Memory Computation

Processing data where it makes sense: Enabling in-memory computation

A Prototype Processing-In-Memory (PIM) Chip for the Data-Intensive Architecture (DIVA) System

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media