Network Processor:
Architecture and Applications
Yan Luo
Yan_Luo@uml.edu
http://faculty.uml.edu/yluo/
| 12/18/05 | Yan Luo, CAR of UML | 1 |
Overview of Network Processors
Network Processor Architectures
Applications
Case Studies
Wireless Mesh Network
a Content-Aware Switch
Conclusion
| 12/18/05 | Yan Luo, CAR of UML | 2 |
Packet Processing in the Future Internet
| Future Internet | ASIC |
| More packets | |
| & | |
| Complex packet | |
| processing | General- |
| Purpose Processors |
•High processing power •Support wire speed •Programmable •Scalable
•Optimized for network applications
• …
| 12/18/05 | Yan Luo, CAR of UML | 3 |
What is Network Processor ? 

Programmable processors optimized for network applications and protocol processing
High performance
Programmable & Flexible
Optimized for packet processing
Main players: AMCC, Intel, Hifn, Ezchip, Agere
Semico Research Corp. Oct. 14, 2003
| 12/18/05 | Yan Luo, CAR of UML | 4 |
Commercial Network Processors
| Vendor | Product | Line | Features |
| speed | |||
| AMCC | nP7510 | OC-192/ | Multi-core, customized ISA, |
| 10 Gbps | multi-tasking | ||
| Intel | IXP2850 | OC-192/ | Multi-core, h/w multi-threaded, |
| 10 Gbps | coprocessor, h/w accelerators | ||
| Hifn | 5NP4G | OC-48/ | Multi-threaded multiprocessor |
| 2.5 Gbps | complex, h/w accelerators | ||
| EZchip | NP-2 | OC-192/ | Classification engines, traffic |
| 10 Gbps | managers | ||
| Agere | PayloadPlus | OC-192/ | Multi-threaded, on-chip traffic |
| 10 Gbps | management |
| 12/18/05 | Yan Luo, CAR of UML | 5 |
Typical Network Processor Architecture
| SDRAM | SRAM |
| (e.g. packet buffer) | (e.g. routing table) |
| Network interfaces | ||
| PE | ||
| Co-processor | H/w accelerator | |
| Bus | Network Processor | |
| 12/18/05 | Yan Luo, CAR of UML | 6 |
Intel IXP2400 Network Processor
| 12 | 7 |
Snapshots of IXP2xxx Based
Systems
ADI Roadrunner Platform
•IPv4 Forwarding/NAT •Forwarding w/ QoS / DiffServ •ATM RAN
•IP RAN
•IPv6/v4 dual stack forwarding
Radisys ENP2611 PCI Packet Processing Engine
•multiservice switches,
•routers, broadband access devices, •intrusion detection and prevention (IDS/IPS) •Voice over IP (VoIP) gateway
•Virtual Private Network gateway •Content-aware switch
| 12/18/05 | Yan Luo, CAR of UML | 8 |
Intel IXP425 Network Processor
12/18/05
StarEast: IXP425 Based Multi-radio
Platform
| 12/18/05 | Yan Luo, CAR of UML | 10 |
Applications of Network Processors
Core router
DSL modem
Edge router
Wireless router
VoIP terminal
VPN gateway
Printer server
| 12/18/05 | Yan Luo, CAR of UML | 11 |
Case Study 1:
Wireless Mesh Network
| 12/18/05 | Yan Luo, CAR of UML | 12 |
Software Stack on StarEast
| 12/18/05 | Yan Luo, CAR of UML | 13 |
Case Study 2: Content-aware Switch
| Internet | www.yahoo.com | |||||||||||
| Media Server | ||||||||||||
| IP | TCP | APP. DATA | ||||||||||
| Application Server | ||||||||||||
| GET /cgi-bin/form HTTP/1.1 | Switch | |||||||||||
| Host: www.yahoo.com… | ||||||||||||
| HTML Server | ||||||||||||
Front-end of a Web cluster, only one Virtual IP
Route packets based on Layer 5 information
Examine application data in addition to IP& TCP
Advantages over layer 4 switches
Better load balancing: distributed based on content type
Faster response: exploit cache affinity
Better resource utilization: partition database
| 12/18/05 | Yan Luo, CAR of UML | 14 |
Mechanisms to Build a Content-aware Switch
TCP gateway
An application level proxy
Setup 1st connection w/ client, parses request server, setup 2nd connection w/ server
Copy overhead
server
TCP splicing
Reduce the copy overhead
Forward packet at network level between the network interface driver and the TCP/IP stack
Two connections are spliced together
Modify fields in IP and TCP headerserver
12/18/05 Yan Luo, CAR
user
kernel
client
user
kernel
client
| Anatomy of TCP Splicing | Bookkeeping of |
| connection states, | |
| selection of servers, | |
| state migration |
SEQ # translation
Checksum Recalculation
Etc.
| Without TCP Splicing | With TCP Splicing |
| 12/18/05 | Yan Luo, CAR of UML | 16 |
Design Options
•Option 0: GP-based (Linux-based) switch
•Option 1: CP setup & and splices connections, DPs process packets sent after splicing
Connection setup & splicing is more complex than data forwarding Packets before splicing need to be passed through DRAM queues
•Option 2: DPs handle connection setup, splicing & forwarding
| 12/18/05 | Yan Luo, CAR of UML | 17 |
IXP 2400 Block Diagram
| SRAM | |||||||||||||
| ME | ME | Scratch | |||||||||||
| controller | |||||||||||||
| Hash | |||||||||||||
| ME | ME | ||||||||||||
| CSR | |||||||||||||
| XScale | |||||||||||||
| IX bus | |||||||||||||
| PCI | |||||||||||||
| ME | ME | ||||||||||||
| interface | |||||||||||||
| SDRAM | |||||||||||||
| ME | ME | ||||||||||||
| controller | |||||||||||||
XScale core
Microengines(MEs)
2 clusters of 4 microengines each
Each ME
run up to 8 threads
16KB instruction store
Local memory
Scratchpad memory, SRAM & DRAM controllers
| 12/18/05 | Yan Luo, CAR of UML | 18 |
Resource Allocation
SRAM (8MB)
•Client side CB list
•Server side CB list
•server selection table
•Locks
Client-side control block list
record states for connections between clients and SpliceNP, states after splicing
Server-side control block list
record states for connections between server and SpliceNP
| DRAM (256MB) | Microengines | |
| Packet buffer | ||
| Rx ME | ||
| Scratchpad (16KB) | Client ME | Server ME | ||||
| Packet queues | ||||||
| Tx ME | ||||||
| 12/18/05 | Yan Luo, CAR of UML | 19 |
Comparison of Functionality
• A lite version of TCP due to the limited instruction size of microengines.
Processing a SYN packet
| Ste | Functionality | TCP | Linux | SpliceNP | |
| p | Splicer | ||||
| 1 | Dequeue packet | Y | Y | Y | |
| 2 | IP header verification | Y | Y | Y | |
| 3 | IP option processing | Y | Y | N | |
| 4 | TCP header verification | Y | Y | Y | |
| 5 | Control block lookup | Y | Y | Y | |
| 6 | Create new socket and set state to | Y | Y | No socket, only | |
| LISTEN | control block | ||||
| 7 | Initialize TCP and IP header template | Y | Y | N | |
| 8 | Reset idle time and keep-alive timer | Y | Y | N | |
| 9 | Process TCP option | Y | Y | Only MSS | |
| option | |||||
| 10 | Send ACK packet, change state to | Y | Y | Y | |
| 12/18/05 | SYN_RCVD | Yan Luo, CAR of UML | 20 | ||
Experimental Setup 


Radisys ENP2611 containing an IXP2400
XScale & ME: 600MHz
8MB SRAM and 128MB DRAM
Three 1Gbps Ethernet ports: 1 for Client port and 2 for Server ports
Server: Apache web server on an Intel 3.0GHz Xeon processor
Client: Httperf on a 2.5GHz Intel P4 processor
Linux-based switch
Loadable kernel module
2.5GHz P4, two 1Gbps Ethernet NICs
| 12/18/05 | Yan Luo, CAR of UML | 21 |
Latency on a Linux-based TCP Splicer
Latency is reduced by TCP splicing
| 12/18/05 | Yan Luo, CAR of UML | 22 |
Latency vs Request File Size
LatencyontheSplicer(ms)
20 
| 18 | Linux-based |
| 16 | NP-based |
14
12
10
8
6
4
2
0
| 1 | 4 | 16 | 64 | 256 | 1024 |
Request file size (KB)
Latency reduced significantly
83.3% (0.6ms 0.1ms) @ 1KB
The larger the file size, the higher the reduction
89.5% @ 1MB file
| 12/18/05 | Yan Luo, CAR of UML | 23 |
Comparison of Packet Processing
Latency
| 12/18/05 | Yan Luo, CAR of UML | 24 |
Analysis of Latency Reduction
| Linux-based | NP-based | ||
| Interrupt: NIC raises an interrupt | polling | ||
| once a packet comes | |||
| NIC-to-mem copy | No copy: Packets | ||
| Xeon 3.0Ghz Dual processor w/ | are processed inside | ||
| without two copies | |||
| 1Gbps Intel Pro 1000 (88544GC) | |||
| NIC, 3 us to copy a 64-byte packet | |||
| by DMA | |||
| Linux processing: OS overheads | IXP processing: | ||
| Processing a data packet in | Optimized ISA | ||
| splicing state: 13.6 us | 6.5 us | ||
| 12/18/05 | Yan Luo, CAR of UML | 25 | |
Throughput vs Request File Size
Throughput (Mbps)
800 
700 




Linux-based 















































NP-based
600 


































































500
400
300
200
100
0
| 1 | 4 | 16 | 64 | 256 | 1024 |
Request file size (KB)
Throughput is increased significantly
5.7x for small file size @ 1KB, 2.2x for large file @ 1MB
Higher improvement for small files
Latency reduction for control packets > data packets
Control packets take a larger portion for small files
| 12/18/05 | Yan Luo, CAR of UML | 26 |
Conclusion
Network Processor combines high- performance packet processing and programmability
A large variety of NP applications
Efficient resource utilization is challenging
| 12/18/05 | Yan Luo, CAR of UML | 27 |
Thank you !
| 12/18/05 | Yan Luo, CAR of UML | 28 |
Microengine
| 12/18/05 | Yan Luo, CAR of UML | 29 |