article

Optimizing software cache performance of packet processing applications

Authors:

Binyu ZangAuthors Info & Claims

ACM SIGPLAN Notices, Volume 42, Issue 7

Pages 227 - 236

https://doi.org/10.1145/1273444.1254808

Published: 13 June 2007 Publication History

Abstract

Network processors (NPs) are widely used in many types of networking equipment due to their high performance and flexibility. For most NPs, software cache is used instead of hardware cache due to the chip area, cost and power constraints. Therefore, programmers should take full responsibility for software cache management which is neither intuitive nor easy to most of them. Actually, without an effective use of it, long memory access latency will be a critical limiting factor to overall applications. Prior researches like hardware multi-threading, wide-word accesses and packet access combination for caching have already been applied to help programmers to overcome this bottleneck. However, most of them do not make enough use of the characteristics of packet processing applications and often perform intraprocedural optimizations only. As a result, the binary codes generated by those techniques often get lower performance than that comes from hand-tuned assembly programming for some applications. In this paper, we propose an algorithm including two techniques - Critical Path Based Analysis (CPBA) and Global Adaptive Localization (GAL), to optimize the software cache performance of packet processing applications. Packet processing applications usually have several hot paths and CPBA tries to insert localization instructions according to their execution frequencies. For further optimizations, GAL eliminates some redundant localization instructions by interprocedural analysis and optimizations. Our algorithm is applied on some representative applications. Experiment results show that it leads to an average speedup by a factor of 1.974.

References

[1]

Product Brief -- Intel IXA SDK 4.3. http://download.intel.com/design/network/ProdBrf/30116605.pdf.

[2]

Ageres PayloadPlus family of network processors. http://www.agere.com/telecom/network processors.html.

[3]

AMCCs nP7xxx series of network processors. http://www.mmcnetworks.com/solutions/.

[4]

J.-L. Baer and T.-F. Chen. Effective hardware-based data prefetching for high-performance processors. IEEE Trans. Comput., 44(5):609--623, 1995.

Digital Library

[5]

J.-L. Baer, D. Low, P. Crowley, and N. Sidhwaney. Memory hierarchy design for a multiprocessor look-up engine. In PACT'03: Proceedings of the 12th International Conference on Parallel Architectures and Compilation Techniques, page 206, Washington, DC, USA, 2003. IEEE Computer Society.

Digital Library

[6]

T. cker Chiueh and P. Pradhan. High performance IP routing table lookup using CPU caching. In INFOCOM (3), pages 1421--1428, 1999.

[7]

CPort network processor family. http://www.windriver.com/cgibin/partnerships/directory/viewProd.cgi?id=1371.

[8]

J. Dai, B. Huang, L. Li, and L. Harrison. Automatically partitioning packet processing applications for pipelined architectures. In PLDI '05: Proceedings of the 2005 ACM SIGPLAN conference on Programming language design and implementation, pages 237--248, New York, NY, USA, 2005. ACM Press.

Digital Library

[9]

J. W. Davidson and S. Jinturkar. Memory access coalescing: A technique for eliminating redundant memory accesses. In SIGPLAN Conference on Programming Language Design and Implementation, pages 186--195, 1994.

Digital Library

[10]

B. C. J. J. E. Kohler, R. Morris and M. F. Kaashoek. The click modular router. In Transactions on Computer Systems, 2000.

Digital Library

[11]

D. C. Feldemeir. Improving gateway performance with a routing-table cache. In Proceedings of IEEE INFOCOMM'88, March 1988.

[12]

D. Gay, P. Levis, R. von Behren, M. Welsh, E. Brewer, and D. Culler. The nesc language: A holistic approach to networked embedded systems, 2003.

[13]

J. Hasan, S. Chandra, and T. Vijaykumar. Efficient use of memory bandwidth to improve network processor throughput, 2003.

[14]

C. Hoare. Communicating sequential processes. In Prentice Hall International Series in Computer Science, 1985.

Digital Library

[15]

IBM PowerNP network processors. http://www-3.ibm.com/chips/techlib/techlib.nsf/products/IBM PowerNP NP4GS3.

[16]

Intel Internet Exchange Architecture Software Development Kit 4.3. http://www.intel.com/design/network/products/npfamily/sdk.htm.

[17]

Intel IXP family of network processors. http://www.intel.com/design/network/products/npfamily/index.htm.

[18]

Introduction to the Auto-Partitioning programming model. http://www.intel.com/design/network/papers/25411401.pdf.

[19]

Intel C Compiler for Intel Network Processors -- Autopartitioning Mode User's Guide. http://www.intel.com.

[20]

S. Iyer, R. Kompella, and N. McKeown. Analysis of a memory architecture for fast packet buffers, 2001.

[21]

H. V. J. Mudigonda and R. Yavatkar. A Case for Data Caching in Network Processors. http://www.cs.utexas.edu/users/vin/pub/pdf/mudigonda04case.pdf.

[22]

R. Jain. Characteristics of destination address locality in computer networks: A comparison of caching schemes. Computer Networks and ISDN Systems, 18(4):243--254, 1989/1990.

Digital Library

[23]

L. Li, B. Huang, J. Dai, and L. Harrison. Automatic multithreading and multiprocessing of c programs for ixp. In PPoPP '05: Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, pages 132--141, New York, NY, USA, 2005. ACM Press.

Digital Library

[24]

T. Liu, X.-F. Li, L. Liu, C.Wu, and R. Ju. Optimizing packet accesses for a domain specific language on network processors. In LCPC '05: International Workshop on Languages and Compilers for Parallel computing, 2005.

Digital Library

[25]

Y. Luo, L. N. Bhuyan, and X. Chen. Shared memory multiprocessor architectures for software ip routers.

[26]

E. B. M. Ruiz-Sanchez andW. Dabbous. Survey and Taxonomy of IP Address Lookup Algorithms. IEEE Network Magazine, March 2001.

[27]

S. A. McKee, R. H. Klenke, K. L. Wright,W. A.Wulf, M. H. Salinas, J. H. Aylor, and A. P. Batson. Smarter memory: Improving bandwidth for streamed references. Computer, 31(7):54--63, 1998.

Digital Library

[28]

J. Mudigonda, H. M. Vin, and R. Yavatkar. Overcoming the memory wall in packet processing: hammers or ladders? In ANCS '05: Proceedings of the 2005 symposium on Architecture for networking and communications systems, pages 1--10, New York, NY, USA, 2005. ACM Press.

Digital Library

[29]

Network Processing Forum. IPSec Forwarding Application-Level Benchmark. http://www.oiforum.com/public/documents/IPSec Forward BM IA.pdf.

[30]

Network Processing Forum. IPv4 Forwarding Benchmark. http://www.oiforum.com/public/documents/IPv4IARev.pdf.

[31]

T. Sherwood, G. Varghese, and B. Calder. A pipelined memory architecture for high throughput network processors, 2003.

[32]

S. Udayakumaran, A. Dominguez, and R. Barua. Dynamic allocation for scratch-pad memory using compile-time decisions. Trans. on Embedded Computing Sys., 5(2):472--511, 2006.

Digital Library

[33]

T. Wolf and M. A. Franklin. Design tradeoffs for embedded network processors. In ARCS, pages 149--164, 2002.

Digital Library

[34]

W. A. Wulf and S. A. McKee. Hitting the memory wall: Implications of the obvious. Computer Architecture News, 23(1):20--24, 1995.

Digital Library

[35]

W. Zhang, G. Chen, M. Kandemir, and M. Karakoy. Interprocedural optimizations for improving data cache performance of array intensive embedded applications. In DAC '03: Proceedings of the 40th conference on Design automation, pages 887--892, New York, NY, USA, 2003. ACM Press.

Digital Library

[36]

B. Zheng, J.-Y. Tsai, B. Y. Zhang, T. Chen, B. Huang, J. H. Li, Y. H. Ding, J. Liang, Y. Zhen, P.-C. Yew, and C.-Q. Zhu. Designing the agassiz compiler for concurrent multithreaded architectures. In Languages and Compilers for Parallel Computing, pages 380--398, 1999.

Digital Library

Index Terms

Optimizing software cache performance of packet processing applications
1. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Optimizing software cache performance of packet processing applications
LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems

Network processors (NPs) are widely used in many types of networking equipment due to their high performance and flexibility. For most NPs, software cache is used instead of hardware cache due to the chip area, cost and power constraints. Therefore, ...
Optimizing the cache performance of non-numeric applications
Cache Architecture for High-Speed Multidimensional Packet Processing
ICICSE '12: Proceedings of the 2012 Sixth International Conference on Internet Computing for Science and Engineering

In this paper, we implement a multi-dimensional packet classification that is based on the hierarchical binary prefix search. We first implement the multi-dimensional binary prefix search packet classification on Intel IXP2400 Network Processor that ...

Comments

Information & Contributors

Information

Published In

cover image ACM SIGPLAN Notices

ACM SIGPLAN Notices Volume 42, Issue 7

Proceedings of the 2007 LCTES conference

July 2007

241 pages

ISSN:0362-1340

EISSN:1558-1160

DOI:10.1145/1273444

Issue’s Table of Contents

LCTES '07: Proceedings of the 2007 ACM SIGPLAN/SIGBED conference on Languages, compilers, and tools for embedded systems
June 2007
258 pages
ISBN:9781595936325
DOI:10.1145/1254766
General Chair:
Santosh Pande
Georgia Institute of Technology, USA
,
Program Chair:
Zhiyuan Li
Purdue University, USA

Copyright © 2007 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 June 2007

Published in SIGPLAN Volume 42, Issue 7

Check for updates

Author Tags

Qualifiers

Article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
371
Total Downloads

Downloads (Last 12 months)3
Downloads (Last 6 weeks)0

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents