Efficient vRouter Networking Stacks are Mandatory for uCPE TCO

Universal CPE (uCPE) is an industry hot topic as it brings more flexibility in the subscription and deployment of value-added services for end-customers through virtualization.

It consists of a single platform running virtual network functions (VNFs) to replace multiple dedicated appliances. The hardware is built using commercial off-the-shelf (COTS) servers, which brings another strong advantage in the global sourcing capabilities of a unique or commonly used hardware platform.

Many industry articles list the benefits of uCPE within the growing trend towards a flexible and programmable network. However, one of the main objections of uCPE from Service Providers is: “will the uCPE meet my customer’s performance requirements?”

This question is valid because network intensive tasks are required for the uCPE including switching between VNFs at the host level. When VNFs are chained, more switching operations are needed. Additionally, performance must be deterministic to avoid facing temporary issues that will be difficult to analyze and troubleshoot.

The price for all-in-one uCPE appliances is very sensitive, and a Telecom Equipment Manufacturer (TEM) starting in the uCPE business cannot afford to build its solution on a platform that won’t match Service Provider’s TCO expectations. I have personally been involved in many of these discussions with TEMs, and understand how critical it is to have the most efficient switching/routing stack to keep the best possible price/performance ratio to be competitive on this market.

Routing is a must-have feature for uCPE. Some TEMs, to optimize the design, take the decision to embed the mandatory vRouter directly in the infrastructure. It has multiple advantages, such as avoiding additional switching operations from/to a dedicated vRouter VNF, but also properly managing end-to-end Quality of Service (QoS) in case congestion happens during the VNF chain.

TEMs may also want to add a low-cost physical CPE design, without virtualization capabilities, based on the same architecture, routing software and management.

This blog post will detail a test that compares the efficiency of a Linux networking stack versus 6WIND’s vRouter stack for uCPE solutions. The test goal is to determine the routing capability of a single Atom C2000 CPU core in a uCPE use case to understand the savings brought by an efficient fast path stack versus Linux. No VNFs are used in this test for the sake of simplicity. The main goal was to showcase a low-cost CPE design based on a COTS server. Please note 6WIND’s vRouter is fully able to combine switching features (by offloading Linux Bridge or Open vSwitch data plane from the kernel) with routing.  After the test, we will conclude that 6WIND vRouter is 7X Linux performance for uCPE.

 

6WIND vRouter: Fast Path Networking Stack for uCPE

 

 

6WIND’s networking stack is designed as an acceleration engine for the Linux networking stack, offloading network processing from the Linux stack into what we refer to as our fast path. 6WIND’s fast path runs on a dedicated set of cores (only one in this example). It has very limited impact on the system’s management. Standard Linux commands will be used to configure networking, and we demonstrate that our fast path properly reflects Linux networking stack states. The advantages of such design for uCPE are huge, and summarized below.

 

One of the drawbacks with other projects such as OVS-DPDK or VPP is that they are designed as standalone stacks. This makes it very complex to mix Ethernet interfaces natively managed by those technologies with LTE or USB interfaces that may not be usable with those stacks. 6WIND’s vRouter natively supports this mix by design.

 

uCPE Benchmark: 6WIND vRouter vs Linux

The test was performed using an Intel Atom(TM) CPU C2758 running at 2.40GHz. A single core of this CPU is dedicated to the forwarding plane for testing in the fast path configuration (FP_MASK):

 

root@alicante:~# fp-conf-tool -D -S

: ${FP_MASK:=1}

: ${FP_PORTS:=’0000:00:14.0 0000:00:14.1 0000:00:14.2 0000:00:14.3′}

 

Three 1Gbps ports are used as LAN ports, while a single 1Gbps port is used for WAN.

These 4x1Gbps ports are connected to an IXIA to generate the traffic and measure the performance.

 

1: Let’s start with the uCPE configuration

 

# Create LAN bridge

brctl addbr lan

brctl addif lan enp0s20f0

brctl addif lan enp0s20f1

brctl addif lan enp0s20f2

 

# Rename WAN port

ip link set dev enp0s20f3 name wan

 

# Change MACs

ip link set dev lan address 00:0C:C3:11:22:33

ip link set dev wan address 00:0C:C3:44:55:66

 

# Interfaces up

ip link set dev enp0s20f0 up

ip link set dev enp0s20f1 up

ip link set dev enp0s20f2 up

ip link set dev lan up

ip link set dev wan up

 

# IP address on LAN

ip address add 192.168.1.254/24 dev lan

 

# IP address on WAN

ip address add 172.16.1.254/24 dev wan

 

# Add NAT on WAN interface

iptables -A POSTROUTING -o wan -j MASQUERADE

 

# Add 3 fake hosts on LAN

ip neighbor add 192.168.1.1 lladdr 00:0c:c3:11:11:11 dev lan nud permanent

ip neighbor add 192.168.1.2 lladdr 00:0c:c3:22:22:22 dev lan nud permanent

ip neighbor add 192.168.1.3 lladdr 00:0c:c3:33:33:33 dev lan nud permanent

 

# Add fake neighbor on WAN

ip neighbor add 172.16.1.100 lladdr 00:0c:c3:44:44:44 dev wan nud permanent

 

# Create a strict priority scheduler on wan egress interface, and shape @ 1Gbps

fp-cli qos-sched-add wan prio 4 rate 1g

 

# Mark voice traffic as prio 1

iptables -t mangle -A POSTROUTING -m dscp –dscp 0x2e -j MARK –set-xmark 0x1

 

# Mark video traffic as prio 2

iptables -t mangle -A POSTROUTING -m dscp –dscp 0x0a -j MARK –set-xmark 0x2

 

 

2: Let’s dump the fast path states to check everything is properly in sync

 

We first check the Linux Bridge has been created in the fast path, and the three LAN ports are added to this bridge:

 

root@alicante:~# fp-cli bridge

Bridge interfaces:

lan-vr0:

nf_call_iptables off:

nf_call_ip6tables off:

enp0s20f2-vr0: master lan-vr0

state: forwarding

features: learning flooding

enp0s20f1-vr0: master lan-vr0

state: forwarding

features: learning flooding

enp0s20f0-vr0: master lan-vr0

state: forwarding

features: learning flooding

 

 

We then validate that the IP addresses and routes are synchronized:

 

root@alicante:~# fp-cli addr4 lan

number of ip address: 1

192.168.1.254 [2]

 

root@alicante:~# fp-cli addr4 wan

number of ip address: 1

172.16.1.254 [1]

 

root@alicante:~# fp-cli route4  type all  table 254

# – Preferred, * – Active, > – selected

(254) 0.0.0.0/0 [03]  NEIGH gw 10.16.18.9 via mgmt0-vr0 (11)

(254) 10.16.18.0/24 [07]  CONNECTED via mgmt0-vr0 (12)

(254) 172.16.1.0/24 [40]  CONNECTED via wan-vr0 (43)

(254) 192.168.1.0/24 [38]  CONNECTED via lan-vr0 (41)

 

 

Finally, on the filtering side, we have our NAT rule, and our two classification rules:

 

root@alicante:~# fp-cli nf4-rules nat

Chain PREROUTING (policy ACCEPT 0 packets 0 bytes)

pkts      bytes target    prot opt  in     out    source              destination

Chain INPUT (policy ACCEPT 0 packets 0 bytes)

pkts      bytes target    prot opt  in     out    source              destination

Chain OUTPUT (policy ACCEPT 0 packets 0 bytes)

pkts      bytes target    prot opt  in     out    source              destination

Chain POSTROUTING (policy ACCEPT 0 packets 0 bytes)

pkts      bytes target    prot opt  in     out    source              destination

0          0 MASQUERADE all  —   any    wan    anywhere            anywhere

 

root@alicante:~# fp-cli nf4-rules mangle

Chain PREROUTING (policy ACCEPT 0 packets 0 bytes)

pkts      bytes target    prot opt  in     out    source              destination

Chain INPUT (policy ACCEPT 0 packets 0 bytes)

pkts      bytes target    prot opt  in     out    source              destination

Chain FORWARD (policy ACCEPT 0 packets 0 bytes)

pkts      bytes target    prot opt  in     out    source              destination

Chain OUTPUT (policy ACCEPT 0 packets 0 bytes)

pkts      bytes target    prot opt  in     out    source              destination

Chain POSTROUTING (policy ACCEPT 568517349 packets 189865968130 bytes)

pkts      bytes target    prot opt  in     out    source              destination

0          0 MARK      all  —   any    any    anywhere            anywhere             MARK set 0x1 DSCP match 0x2e

0          0 MARK      all  —   any    any    anywhere            anywhere             MARK set 0x2 DSCP match 0x0a

 

 

3: Traffic generated from the IXIA

 

From the IXIA, I am sending IMIX traffic (64B x 7, 570B x 4, 1518B x 1) from all the interfaces:

 

Port 1 (LAN) -> Port 4 (WAN):

MAC: 00 0C C3 11 11 11 -> 00 0C C3 11 22 33

IPs: 192.168.1.1       -> 172.16.1.100

UDP: Port 10000        -> Port 10000

DSCP: EF (46)

Rate: 33%

 

Port 2 (LAN) -> Port 4 (WAN):

MAC: 00 0C C3 22 22 22 -> 00 0C C3 11 22 33

IPs: 192.168.1.2       -> 172.16.1.100

UDP: Port 20000        -> Port 20000

DSCP: AF Class 1 Low Drop Precendence (10)

Rate: 33%

 

Port 3 (LAN) -> Port 4 (WAN):

MAC: 00 0C C3 33 33 33 -> 00 0C C3 11 22 33

IPs: 192.168.1.3       -> 172.16.1.100

UDP: Port 30000        -> Port 30000

DSCP: 0

Rate: 38% (sending a little more than 1G from LAN to WAN so that traffic is dropped by QoS)

 

Port 4 (WAN) -> Port 1 (LAN):

MAC: 00 0C C3 44 44 44 -> 00 0C C3 44 55 66

IPs: 172.16.1.100      -> 172.16.1.254

UDP: Port 10000        -> Port 10000

DSCP: 0

Rate: 33%

 

Port 4 (WAN) -> Port 2 (LAN):

MAC: 00 0C C3 44 44 44 -> 00 0C C3 44 55 66

IPs: 172.16.1.100      -> 172.16.1.254

UDP: Port 20000        -> Port 20000

DSCP: 0

Rate: 33%

 

Port 4 (WAN) -> Port 3 (LAN):

MAC: 00 0C C3 44 44 44 -> 00 0C C3 44 55 66

IPs: 172.16.1.100      -> 172.16.1.254

UDP: Port 30000        -> Port 30000

DSCP: 0

Rate: 33%

 

 

4: Starting the traffic

 

I start traffic from LAN to WAN to establish the conntracks:

 

root@alicante:~# conntrack -L

udp      17 28 src=192.168.1.1 dst=172.16.1.100 sport=10000 dport=10000 [UNREPLIED] src=172.16.1.100 dst=172.16.1.254 sport=10000 dport=10000 mark=0 use=1

udp      17 27 src=192.168.1.3 dst=172.16.1.100 sport=30000 dport=30000 [UNREPLIED] src=172.16.1.100 dst=172.16.1.254 sport=30000 dport=30000 mark=0 use=1

udp      17 27 src=192.168.1.2 dst=172.16.1.100 sport=20000 dport=20000 [UNREPLIED] src=172.16.1.100 dst=172.16.1.254 sport=20000 dport=20000 mark=0 use=1

 

Those are synced in the fast path:

 

root@alicante:~# fp-cli nfct4

Number of flows: 8/1024

Flow: #2

Proto: 17

Original: src: 192.168.1.2:20000 -> dst: 172.16.1.100:20000

Reply:    src: 172.16.1.100:20000 -> dst: 172.16.1.254:20000

VRF-ID: 0       Zone: 0 Mark: 0x0

Flag: 0x45, hitflag: 0x01,

snat: yes, dnat: no,

assured: no, seen_reply: no,

unreplied: yes, expected: no,

update: yes, end: no

Stats:

Original: pkt: 24008990, bytes: 3762809704

Reply:    pkt: 0, bytes: 0

 

Flow: #3

Proto: 17

Original: src: 192.168.1.3:30000 -> dst: 172.16.1.100:30000

Reply:    src: 172.16.1.100:30000 -> dst: 172.16.1.254:30000

VRF-ID: 0       Zone: 0 Mark: 0x0

Flag: 0x45, hitflag: 0x01,

snat: yes, dnat: no,

assured: no, seen_reply: no,

unreplied: yes, expected: no,

update: yes, end: no

Stats:

Original: pkt: 27154256, bytes: 534004408

Reply:    pkt: 0, bytes: 0

 

..

Flow: #37

Proto: 17

Original: src: 192.168.1.1:10000 -> dst: 172.16.1.100:10000

Reply:    src: 172.16.1.100:10000 -> dst: 172.16.1.254:10000

VRF-ID: 0       Zone: 0 Mark: 0x0

Flag: 0x45, hitflag: 0x01,

snat: yes, dnat: no,

assured: no, seen_reply: no,

unreplied: yes, expected: no,

update: yes, end: no

Stats:

Original: pkt: 24542609, bytes: 3939691804

Reply:    pkt: 0, bytes: 0

 

 

I am then sending the return traffic, from WAN to LAN @ 1Gbps with iMIX packets.

Conntracks are now marked as assured, meaning the connection tracking table has seen the reply:

 

root@alicante:~# conntrack -L

udp      17 169 src=192.168.1.1 dst=172.16.1.100 sport=10000 dport=10000 src=172.16.1.100 dst=172.16.1.254 sport=10000 dport=10000 [ASSURED] mark=0 use=1

udp      17 169 src=192.168.1.3 dst=172.16.1.100 sport=30000 dport=30000 src=172.16.1.100 dst=172.16.1.254 sport=30000 dport=30000 [ASSURED] mark=0 use=1

udp      17 169 src=192.168.1.2 dst=172.16.1.100 sport=20000 dport=20000 src=172.16.1.100 dst=172.16.1.254 sport=20000 dport=20000 [ASSURED] mark=0 use=1

 

And synced in the fast path:

 

root@alicante:~# fp-cli nfct4

Number of flows: 7/1024

Flow: #2

Proto: 17

Original: src: 192.168.1.2:20000 -> dst: 172.16.1.100:20000

Reply:    src: 172.16.1.100:20000 -> dst: 172.16.1.254:20000

VRF-ID: 0       Zone: 0 Mark: 0x0

Flag: 0x15, hitflag: 0x01,

snat: yes, dnat: no,

assured: yes, seen_reply: no,

unreplied: no, expected: no,

update: yes, end: no

Stats:

Original: pkt: 63093026, bytes: 4003412860

Reply:    pkt: 15317000, bytes: 794127526

 

Flow: #3

Proto: 17

Original: src: 192.168.1.3:30000 -> dst: 172.16.1.100:30000

Reply:    src: 172.16.1.100:30000 -> dst: 172.16.1.254:30000

VRF-ID: 0       Zone: 0 Mark: 0x0

Flag: 0x15, hitflag: 0x01,

snat: yes, dnat: no,

assured: yes, seen_reply: no,

unreplied: no, expected: no,

update: yes, end: no

Stats:

Original: pkt: 72220564, bytes: 2780797232

Reply:    pkt: 15316878, bytes: 791419450

 

Flow: #37

Proto: 17

Original: src: 192.168.1.1:10000 -> dst: 172.16.1.100:10000

Reply:    src: 172.16.1.100:10000 -> dst: 172.16.1.254:10000

VRF-ID: 0       Zone: 0 Mark: 0x0

Flag: 0x15, hitflag: 0x01,

snat: yes, dnat: no,

assured: yes, seen_reply: no,

unreplied: no, expected: no,

update: yes, end: no

Stats:

Original: pkt: 63626752, bytes: 4176151352

Reply:    pkt: 15317006, bytes: 787949596

 

 

QoS is enabled and doing its job (dropping some packets for class 4, as the total traffic from LAN to WAN is above 1Gbps):

 

root@alicante:~# fp-cli qos-stats-reset

root@alicante:~# fp-cli qos-stats non-zero

sched iface=wan-vrf0 [1]:

| enq_ok_pkts:374981

| enq_drop_qfull_pkts:19361

| xmit_ok_pkts:375000

| class classid=0x1 [9]:

| | enq_ok_pkts:125066

| | xmit_ok_pkts:125074

| class classid=0x2 [10]:

| | enq_ok_pkts:125075

| | xmit_ok_pkts:125076

| class classid=0x3 [11]:

| class classid=0x4 [12]:

| | enq_ok_pkts:124859

| | enq_drop_qfull_pkts:19361

| | xmit_ok_pkts:124877

 

 

5: Results

Using 6WIND’s vRouter networking stack, I was able to sustain the 1Gbps bi-directional with IMIX traffic and all features enabled on 1 single core.

 

Data plane CPU usage is showing 100%, but may still have some margin thanks to packet bulking efficiency:

root@alicante:~# fp-cpu-usage

Fast path CPU usage:

cpu: %busy     cycles   cycles/packet   cycles/ic pkt

1:  100%  480029592            2361               0

average cycles/packets received from NIC: 2361 (480029592/203297)

I was not losing any packets during the test due to CPU load, as the hardware queues were reporting no loss:

root@alicante:~# ethtool -S wan | grep miss

rx_missed_errors: 0

rx_missed_packets: 0

 

6: Same test with Linux

 

It is interesting to understand how fast 6WIND’s vRouter is compared to Linux.

Thus, the same test was done by stopping 6WIND’s fast path, and still using a single core to perform the network processing. For this, I pinned all network IRQs to the same core that the fast path was previously using.

 

The configuration of the device is exactly the same, with the exception of the QoS scheduler configuration. The fast path does not support Linux TC synchronization, but has its own dedicated API for QoS. Equivalent Linux TC configuration looks like:

 

# Create a SP scheduler on wan egress interface

tc qdisc add dev wan root handle 1: prio bands 4 priomap 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3

tc filter add dev wan parent 1: protocol all prio 1 handle 0x1 fw classid 1:1

tc filter add dev wan parent 1: protocol all prio 1 handle 0x2 fw classid 1:2

 

The result is totally different: out of the 2Gbps sent by the IXIA, only 300Mbps is successfully processed by Linux, while the 6WIND fast path was able to sustain the full 2Gbps.

 

Conclusion: 6WIND vRouter is 7X Linux Performance for uCPE

 

The first conclusion of our benchmark is that without an efficient networking stack the number of COTS server cores needed to sustain the target performance for uCPE appliances may be too high for the targeted TCO. 6WIND’s vRouter networking stack, as demonstrated in this benchmark, allows customers to pick an entry level 2 core CPU to build a physical CPE with good performance, while a 4 core CPU running a Linux stack wouldn’t even be close to the same performance.

 

When adding VNFs, the efficiency of the host routing stack will be even more important, as additional switching operations from/to the VNFs will be combined with routing. Thus, the CPU cycles saved by the efficient routing stack will save more compute for the VNFs themselves.

If you are designing a uCPE solution and looking to increase performance while saving costs, contact us today to request an evaluation of 6WIND vRouter.

 


 

Nicolas Harnois is Pre-sales Manager at 6WIND.