ACM Home Page
Please provide us with feedback. Feedback
Exploiting locality to ameliorate packet queue contention and serialization
Full text PdfPdf (411 KB)
Source Conference On Computing Frontiers archive
Proceedings of the 3rd conference on Computing frontiers table of contents
Ischia, Italy
SESSION: High performance architectures table of contents
Pages: 279 - 290  
Year of Publication: 2006
ISBN:1-59593-302-6
Authors
Sailesh Kumar  Washington University, St. Louis, MO
John Maschmeyer  Washington University, St. Louis, MO
Patrick Crowley  Washington University, St. Louis, MO
Sponsor
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 3,   Downloads (12 Months): 24,   Citation Count: 3
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Review this Article  
Save this Article to a Binder    Display Formats: BibTex  EndNote ACM Ref   
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1128022.1128060
What is a DOI?

ABSTRACT

Packet processing systems maintain high throughput despite relatively high memory latencies by exploiting the coarse-grained parallelism available between packets. In particular, multiple processors are used to overlap the processing of multiple packets. Packet queuing-the fundamental mechanism enabling packet scheduling, differentiated services, and traffic isolation-requires a read-modify-write operation on a linked list data structure to enqueue and dequeue packets; this operation represents a potential serializing bottleneck. If all packets awaiting service are destined for different queues, these read-modify-write cycles can proceed in parallel. However, if all or many of the incoming packets are destined for the same queue, or for a small number of queues, then system throughput will be serialized by these sequential external memory operations. For this reason, low latency SRAMs are used to implement the queue data structures. This reduces the absolute cost of serialization but does not eliminate it; SRAM latencies determine system throughput.In this paper we observe that the worst-case scenario for packet queuing coincides with the best-case scenario for caches: i.e., when locality exists and the majority of packets are destined for a small number of queues. The main contribution of this work is the queuing cache, which consists of a hardware cache and a closely coupled queuing engine that implements queue operations. The queuing cache improves performance dramatically by moving the bottleneck from external memory onto the packet processor, where clock rates are higher and latencies are lower. We compare the queuing cache to a number of alternatives, specifically, SRAM controllers with: no queuing support, a software-controlled cache plus a queuing engine (like that used on Intel's IXP network processor), and a hardware cache. Relative to these models, we show that a queuing cache improves worst-case throughput by factors of 3.1, 1.5, and 2.1 and the throughput of real-world traffic traces by factors of 2.6, 1.3, and 1.75, respectively. We also show that the queuing cache decreases external memory bandwidth usage, on-chip communication, and the number of queuing instructions executed under best-case, worst-case and real-world traffic workloads. Based on our VHDL models, we conclude that a queuing cache could be implemented at a low cost relative to the resulting performance and efficiency benefits.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

 
1
M. Adiletta, et al. "The Next Generation of Intel IXP Network Processors," Intel Technology Journal, vol. 6, no 3, pp. 6--18, Aug 2002.
 
2
 
3
4
 
5
S. Iyer, R. R. Compella, and N. McKeown, "Designing Buffers for Router Line Cards," Stanford University Technical Report - TR02-HPNG-031001, Stanford, CA, 2002.
 
6
S. Iyer, R. R. Kompella, and N. McKeown, "Analysis of a Memory Architecture for Fast Packet Buffers," IEEE HPSR'02, Dallas, May 2001.
 
7
G. Shrimali and N. McKeown, "Statistical Guarantees for Packet Buffers: The Monolithic DRAM Case," Stanford University HPNG Technical Report - TR04-HPNG-020603, Stanford, CA, 2004.
8
 
9
 
10
Y. Joo and N. McKeown, "Doubling Memory Bandwidth for Network Buffers," Proc. IEEE Infocom 1998, vol. 2, pp. 808--815, San Francisco.
 
11
12
 
13
A. Nikologiannis, M. Katevenis, "Efficient Per-Flow Queuing in DRAM at OC-192 Line Rate using Out-of-Order Execution Techniques," Proc. IEEE International Conference on Communications (ICC'2001), Helsinki, pp. 2048--2052, June 2001.
 
14
S.-T. Chuang, A. Goel, N. McKeown, B. Prabhakar, "Matching output queuing with a combined input and output queued switch," in Proc. IEEE INFOCOM, 1999.
 
15
M.G. Hluchyj, M.J. Karol, "Queuing in high-performance packet switching," IEEE J. Sel. Areas Commun. 6 (9), 1587--1597, 1988.
 
16
 
17
J-G Chen, et al. Chapter 14-Implementing High-Performance, High-Value Traffic Management Using Agere Network Processor Solutions, in Network Processor Design, volume 2, Morgan Kaufmann, 2004.
 
18
 
19
V. Frost and B. Melamed, "Traffic Modeling for Telecommunications Networks," IEEE Communications Magazine, 32(3), pp. 70--80, March, 1994.
20
 
21
Backbone packet header traces at OC192 and OC48, collected at Cooperative Association for Internet Data Analysis (CAIDA), http://www.caida.org/projects/trends/data/
 
22
Backbone and edge packet header traces collected from the Internet Measurement, Internet Analysis, National Laboratory for Applied Network Research (NLANR), http://moat.nlanr.net/


Collaborative Colleagues:
Sailesh Kumar: colleagues
John Maschmeyer: colleagues
Patrick Crowley: colleagues