How should Amdahl’s law drive the redesigns of socket system calls for an OS on a multicore CPU?


The previous post asked a question about the fastest BSD socket – TCP, UDP or SCTP. I think TCP is the fastest protocol.

This might be considered surprising because a TCP stack is far more complex than UDP. But, if we analyze the processing load for each payload received into a recv() (same for recvmsg() recvfrom() readv() or read()), most of this load is dedicated to the system call that transfers this payload to userland.

The following protocols and sockets have the same design bottlenecks:

  • TIPC sockets, with SOCK_DGRAM, SOCK_RDM, SOCK_SEQPACKET sockets
  • SCTP sockets with SOCK_SEQPACKET
  • ATM sockets which are SOCK_DGRAM-like
  • RAW IP sockets which are SOCK_RAW
  • On Linux, Netlink sockets (PF_NETLINK) which are SOCK_DGRAM sockets
  • On BSD based kernel stacks, the routing sockets (PF_ROUTE) which are SOCK_RAW sockets
  • etc for any other non STREAM socket

TCP is more efficient than UDP because it is a stream protocol while UDP is a datagram socket and protocol (SOCK_DGRAM). So, should we give up using UDP?

No, we shouldn’t. However, let’s analyze why a stream socket is more efficient by comparing a transfer of 900,000 bytes of Layer 5 payload comprising 900 x 1,000 bytes of Layer 5 datagram for the following receive and send calls:

uint8_t buf[15000];

read(s, buf, sizeof(buf));

write(s, buf, sizeof(buf));

Both UDP and TCP need to copy 900,000 bytes in order to communicate with userland:

  • if it’s a UDP socket, UDP payloads will transfer the 900,000 bytes using 900 system calls of 1,000 bytes each
  • if it’s a TCP socket, TCP payloads will transfer the 900,000 bytes using 60 system calls of 15,000 bytes each
  • both have the same memory copy constraints

Assuming that the number of cycles for a system call is much higher than for a memory copy or the stack processing, then thanks to an approximation of Amdhal law, we can estimate that TCP will be O(15) times faster than UDP.

Asymptotically, this approximation is even closer to reality for a multicore CPU because the kernel networking stack processing can be run on a different core from the core which is executing the read() or the write() system calls. So on multicore CPUs, TCP is even faster than UDP.

The bottom line is simple: TCP is faster because the cost of the system calls is amortized over many payloads.

Clearly, therefore, new system calls are required:

  • either we group the system calls in order to amortize their costs
  • either we redesign the call flows between userland and the kernel in order to avoid the overhead of system calls

Some attempts at the second approach have been initiated using VDSO (Virtual Dynamic Shared Object, see linux-gate.dso.1 of your ldd /bin/ls) but very few CPU BSPs support VDSO and many system calls are still missing.
The first option is easier to apply without breaking current OS designs. For instance, recvmmsg() and sendmmsg() may represent the start of this new trend, but, as far as I know, they are not defined by any POSIX standards.

There are some other factors preventing sockets from scaling on a multicore/multithread CPU. For example, the system does not provide proper load balancing (like a UDP socket dispatch or TCP accept() dispatch) to multiple daemons which are bind()-ed on the same socket. This will be the topic of another post.

More information about 6WINDGate architecture can be found here.

You can download more detailed documents here.

You can check 6WINDGate FAQ here.

Post a Reply