Thursday, December 1, 2011

Performance lessons for HTTP sockets

hellepoll is a fast little webserver that I’ve made.  I’ve written a few little custom event loops like this over the years; this isn’t the first, and likely not the last.

In the tightest HTTP handling code you have to handle pipelining and flushing.  This is the hellepoll secret:

Normally, HTTP connections are kept alive and reused for many requests.  The only non-keep-alive connections you might see is if you are behind a particularly basic proxy (nginx stop blushing, people seem to forgive you) or if your client is actually a bit of app code running with some dead-simple HTTP client rather than a sophisticated browser.

HTTP clients can pipeline requests - that is, they can issue a request to fetch a page before previously-requested pages have been served.  The server has to reply in order, but can prepare pages to be sent rather than having to wait to process a request when previous requests are being served.

HTTP pipelining from wikipedia

If you are using a blocking IO model, you have to really deal with each request in sequence; attempting to parse-ahead pipelined requests and juggle them will result in you writing a mini-asynchronous IO loop of your own and you might as well just bite the bullet and be asynchronous from the beginning… so asynchronous IO has a definite advantage in code organisation when trying to maximise performance of an HTTP stream.

HTTP 1.1 clients must support chunked transfer encoding.  HTTP 1.0 may not.  As trunked transfer encoding simplifies replying to pipelining, I find it simpler to force connection-close on HTTP 1.0 keep-alive requests that do not support it.  If you are writing to a connection-close stream you do not need to write-ahead the content-length which simplifies the house-keeping and buffering you might otherwise need to do.  HTTP 1.0 is really the domain of non-browser connections so this is fairly marginal (usually; of course, in your job, you might meet such connections repeatedly; sad).

With chunked transfer you can write chunks as quickly as they are available - say one per call to some write() function - and even put many headers after the body.  You can’t rewrite the response code, however, so turning a 200 OK into some other error code after you’ve started writing to the socket is a no-no.

The pipelined requests have to be written out in the strict order they were received which means if you are doing them in parallel you may need to buffer some writes until its the turn for their request.

Fast TCP connections are really down to buffer management.  Inside the kernel, there is a read and write buffer associated with each socket; they are typically a few K.  Reading and writing into these buffers takes a system call, which is not free.  If you read these buffers one byte at a time you’ll end up with atrocious performance dominated by the cost of syscalls.  Once you set about trying to maximise performance, you start having to count syscalls and you start trying to keep the syscall count per connection down in the single digits.

So you really need to buffer in your user-space for each socket, both reads and writes.  A single read-buffer is ideal but I’ve found putting the write buffers into a linked list when they can’t be written through to be effective.  My servers, handling strange things like messaging and video broadcasting and such have made efforts to share writable chunks between connections.  Eventually it will be possible to write on the socket, and at that point you can use writev() to send off as many buffers as possible in a single syscall.

Browser support for gzipping of data offers exciting opportunities.  If you have lots of reused fragments you can pre-deflate these fragments - possibly using only Huffman, although patching up the offsets for matches is certainly viable - and writing them ready compressed whilst tagging dynamic data as uncompressed literals rather than expending CPU compressing them on the fly.  The more re-used your data and the slower your language choice the bigger the payoff - I’d imagine it an ideal speedup for Python-based multi-user chat servers, for example.

Some clients use keep-alive connections but don’t pipeline.  In this case, when you write something, they’ll be waiting for Nagle to kick in and actually send it over the wire before sending their next request.  This means they will be slower than if they had connection-closed!  In these circumstances, it is important to flush the write buffer immediately you have written the final part of the response if the connection is keep-alive and you haven’t already received the next pipelined request.  There is no nice ‘flush’ function, but you can achieve it by calling set_nodelay(true) and then calling set_nodelay(false) before any more writes.  This is two dreaded syscalls so its best to check for a pipelined request before invoking these.

A non-blocking socket signals it is available for write every poll/select/epoll that you make even if you have nothing to write.  Edge-triggering is how you can prevent wasting time on sockets in this state and can give a comfortable speed boost.

I haven’t taken more than a cursory glance at SPDY, but that glance says its pipelining++ and sequential chunks belonging to different requests can be written in any order.  It sounds delicious.

Notes

  1. williamedwardscoder posted this

 ↓ click the "share" button below!