Bulktransfer via TCP and UDP - A Rabbit/Turtle Race?

Bulktransfer via TCP and UDP - A Rabbit/Turtle Race?

Hey Folks,

this small post is about an experiment I started to determine if a hypothesis I had was correct: “Sending large amounts of data via TCP is many times faster than UDP, because there are less syscalls involved”. I came to this hypothesis, because I did some network performance tests with iperf3 with the following outcome: With TCP, iperf3 easily reaches 10Gbit/s with a single connection. Using UDP, along with all performance tuning options I could find, no more than 5Gbit/s were possible. TODO: Sending was fast, but the NICs didn’t send the packets properly, issue: Segmentation offloading. I first thought: well, you often read that TCP in the Linux kernel was tuned over years, and UDP was left behind in some cases. But is this all? What may lead to so different levels of performance? Note: I only hat 10Gbit/s network speed available, so potentially, TCP could have been much faster.

To dig into this, I decided to dig into this with a complete separate approach. I wrote a small tool (TODO: Link) that transfers a file from the disk over network. It’s quite compact and can be started either in server or in client mode. By implementing the actual transfer using TCP and UDP sockets, I created a fair testing environment for both. Of course, transferring the file via UDP without any reliability does not lead to an intact file on the receiver side, but that’s fine. The focus is set to the sending side, where I’m interested in the difference of sending speed of both protocols. But before looking at the results, let’s have a short look at the tool.

As mentioned in my introductory article (TODO: Link), I love to code in Go. Therefor, here we have the first example based on Go. The filetransfer does not do much magic, there are only a few steps that client and server are doing: The server listens on the configured port and waits for an incoming connection, the client dials to the configured address where the server listens. Then the client opens the file to transfer and copies its content into the connection using io.Copy. I started the tool with TCP enabled on two machines connected with a direct 10Gbit/s link. Worked seamlessly… 10Gbit/s reliable file transfer in 5 minutes, pretty awesome. Now let’s head to UDP, which I expect to perform not as good as TCP due to previous experiences. But wait: UDP also sends at 10Gbit/s without any problems. But looking at the receiver side, no incoming packets for UDP, whereas with TCP also 10Gbit/s are incoming. Wow, that’s absolutely not what I expected, but let’s head into this.

I decided to start with my application instead of looking into the Kernel, because it’s easier to at least exclude the tool as possible reason for these results. Since I haven’t coded any socket.Write/Read calls by myself, the first thing to look at is io.Copy. The method comments say

Copy copies from src to dst until either EOF is reached on src or an error occurs. It returns the number of bytes copied and the first error encountered while copying, if any.

io.Copy uses the copyBuffer internally. So looking into the code, it’s quite simple and understandable: it checks if some magic functions called WriteTo and ReadFrom are available on the writer and reader, respectively, and uses them. Otherwise, a buffer is created and in a loop, the buffer is filled from src and written to dst.

// copyBuffer is the actual implementation of Copy and CopyBuffer.
// if buf is nil, one is allocated.
func copyBuffer(dst Writer, src Reader, buf []byte) (written int64, err error) {
	// If the reader has a WriteTo method, use it to do the copy.
	// Avoids an allocation and a copy.
	if wt, ok := src.(WriterTo); ok {
		return wt.WriteTo(dst)
	}
	// Similarly, if the writer has a ReadFrom method, use it to do the copy.
	if rt, ok := dst.(ReaderFrom); ok {
		return rt.ReadFrom(src)
	}
	if buf == nil {
		size := 32 * 1024
		if l, ok := src.(*LimitedReader); ok && int64(size) > l.N {
			if l.N < 1 {
				size = 1
			} else {
				size = int(l.N)
			}
		}
		buf = make([]byte, size)
	}
	for {
		nr, er := src.Read(buf)
		if nr > 0 {
			nw, ew := dst.Write(buf[0:nr])
			if nw < 0 || nr < nw {
				nw = 0
				if ew == nil {
					ew = errInvalidWrite
				}
			}
			written += int64(nw)
			if ew != nil {
				err = ew
				break
			}
			if nr != nw {
				err = ErrShortWrite
				break
			}
		}
		if er != nil {
			if er != EOF {
				err = er
			}
			break
		}
	}
	return written, err
}

Well, we have 2 interesting issues to analyze: The methods ReadFrom and WriteTo and the buffer size, which seem to be significantly larger than our MTU. Let’s start with the first one. I added two breakpoints in the WriteTo and ReadFrom call and started the tool again, using UDP and TCP. In UDP, none of the breakpoints was triggered, so the copying using the buffer applies here. In TCP.. Gotcha! It’s the ReadFrom call that was used to transfer the file. Now let’s dig into this.

The TCP socket implements the ReaderFrominterface of io:

// ReaderFrom is the interface that wraps the ReadFrom method.
//
// ReadFrom reads data from r until EOF or error.
// The return value n is the number of bytes read.
// Any error except EOF encountered during the read is also returned.
//
type ReaderFrom interface {
	ReadFrom(r Reader) (n int64, err error)
}

Okay but why does this work and the copy buffer code in UDP not? Looking into how the TCPSocket implements ReadFrom, the answer is quite simple: splice and sendfile, as shown in the implementation of ReadFrom.

func (c *TCPConn) readFrom(r io.Reader) (int64, error) {
	if n, err, handled := splice(c.fd, r); handled {
		return n, err
	}
	if n, err, handled := sendFile(c.fd, r); handled {
		return n, err
	}
	return genericReadFrom(c, r)
}

Both functions aim to copy buffers into particular file descriptors, c.fd in this example. Sendfile requires the source to be a structure that support mmap like operations source and the target to be a socket. Well, that’s the case here, so we should be fine. Splice works a bit different, it requires at least one of incoming our outgoing descriptors to be a pipe. Instead of trying to dive into details for both approaches, let’s again check with some breakpoints which of those calls is used in our case.

And the winner is: sendfile. The official doc says about sendfile: sendfile() copies data between one file descriptor and another. Because this copying is done within the kernel, sendfile() is more efficient than the combination of read(2) and write(2), which would require transferring data to and from user space.. Great, now we at least know, that it’s an unfair comparison of TCP and UDP with this trick used by TCP. So what happens if we remove the ReadFrom call and just let TCP also use the manual copying via the buffer. Okay, TCP again rocks 10Gbit/s on both sides. So why does UDP not transfers any packet to the remote machine?.

Another look at the copyBuffer method hints another interesting thing: the actual buffer size, which is 32K, so far over our MTU of 1500byte. And now one of the big differences in UDP and TCP kicks in: UDP tries to put the whole buffer into one packet by default, whereas TCP, since its stream-based, splits the buffer into correctly sized packets. So we can fix this for UDP if we decrease the buffer size to something that fits into the MTU, let’s say 1400byte. Running the tool again with this buffer size leads to interesting results:

TCP UDP
4.5Gbit/s 3.2Gbit/s

So even for TCP far from 10Gbit/s. One interesting thing we can observe from this table: TCP sockets are indeed a bit faster than UDP, but it’s not the main cause I assumed for TCP outperforming UDP. Without further investigation (which may follow later), I assume the number of syscalls for read/write (and consequently the number of context switches) to limit the performance of both sockets in this case.

In summary, I can say that this comparison is not really a rabbit/turtle race if we create fair conditions (removing the sendfile) hack. So for me, the actual outcome was surprising and I learned: If you’re stuck in your work, start again with a different setup from scratch that tackles the same problem, and you may find really interesting observations.

Final note: Also for UDP, there are methods to move larger buffers to the kernel and let the kernel (or even the NIC) do the packetizing. I will get to them in the upcoming articles.

As always, please mail me any feedback you have, I really appreciate any kind of comments or additional information, and I will update this article with any helpful input I get.

Cheers, Marten