Monday, September 17, 2012

Explicitly async APIs vs Coroutines

TL;DR coroutines are as complicated as normal multi-threaded programming, yet locking is often over-looked.  Coroutines hide race-condition bugs.  Explicitly asynchronous APIs cannot hide race-condition bugs from you in the same way.

A post Coroutines vs explicitly async APIs has been stirring controversy lately, saying that the coroutine IO style is superior to explicitly asynchronous IO.  Its main points:

The disadvantage here is the way you write the [async] code:

  1. More cognitive effort is required to express the control flow in a soup of callback handlers.
  2. The resulting code is harder to read.

    From 1 and 2, we also get:

  3. The resulting code is harder to maintain.

    Because maintenance requires both reading and writing.

I believe these conclusions to be wrong.  The race-conditions I explain below invalidate these posited advantages; it takes cognitive effort to x-ray coroutine code and remember what blocks where; it makes the code harder to read as you have to follow-through all function calls; and it makes it harder to maintain as you must re-model the yielding structure of the app and the locking needed each time you revisit it.

But first, we must clarify what we mean by explicitly asynchronous IO and coroutine IO APIs:

An explicitly asynchronous API is where you have an event loop and IO functions take callbacks that get called - from the event loop - when the IO completes.

The event-loop is the mainstream program building block.  Its the basis of the Windows API and the apps running on Windows, its how the apps on your phone work and its how games tick.

But explicitly asynchronous APIs are a bit rarer.  Most apps using event loops still do blocking IO.  If that blocking IO might block for a long time - i.e. its waiting for something from a socket - it might be done in a thread dedicated to that IO and using a queue to get actionable messages back into the event loop where the main logic resides.

Low-level asynchronous C APIs for sockets and more recently files arrived in Linux and Windows and offered clear performance advantages in many situations.  Increasingly web-servers moved over to asynchronous IO but often hid this from the application developer. 

Recently there have been a rash of explicitly asynchronous APIs at a higher level; node.js and tornado, for example.  node.js is particularly significant because its gaining traction, attention and has no concept of threads allowing any other way of doing IO.

A coroutine IO is the pattern where the IO function suspends and a green-thread scheduler continues other tasks when IO that would block is encountered.  The definition of coroutines is a bit more general than that and include iterators and such too; but my definition is the mainstream one in the article I’m refuting that is being compared to explicitly asynchronous APIs so this is an appropriate narrowing.

Explicit coroutine IO are rare beasts, just like explicitly asynchronous APIs.  They are typically end-user libraries that you have to be disciplined and use exclusively.  They typically operate by rewiring some pretty low-level stuff and don’t mix well with other event mechanisms.  Some examples include libtask or gevent.

The bug:

Coroutine IO is, as I said, cooperatively green-threaded threads.  They are multi-threaded programming, just with serialised execution.  And they can suffer from race-conditions too, that scrounge of all multi-threaded code.  This is my major gripe with them; you have to be able to x-ray code to know if a function you call might itself yield to the scheduler.  Consider the following innocuous snippet:

session = sessions.get_session(cookie);
if(!session)
   session = sessions.create_session(user_id);

Looks atomic, doesn’t it?  The sessions variable is presumably a singleton and is shared state between coroutines.

Imagine you inspect this code.  Imagine you wrote it.  You know what get_session and create_session do.  You know they don’t yield to the scheduler.  You move on.

And then during some maintenance or future work someone drops a syslog logger into the app.  Suddenly the log-lines littering create_session can yield.  Bang!

Coroutines were posited as preferable to explicitly asynchronous APIs because they were simpler and more robust.  Well, that’s just because its so shockingly easy to write incorrect non-defensive code that actually mostly works most of the time.

We all know that multi-threaded programming is hard to get right.  Well, as I pointed out, coroutines are multi-threaded programming.  You need to lock.  Yet some coroutine libraries omit locking primitives and others hide them at the bottom of the docs saying “you probably don’t need them”.

The Solution:

Really falls into two camps:

Go and its goroutines are try to focus on share-nothing concurrency:

Do not communicate by sharing memory; instead, share memory by communicating

I’m a big fan of this approach.  Its not the kind of coroutine IO I am rallying against here.  Its not the average monkey-patching add-on library approach; its just a shame that the realities are such that they have to allow you to share memory if you really want to.

When the language doesn’t support channels natively, this brings us to the second solution: explicitly asynchronous IO APIs.  Especially if your language has closures (C++11 thank you!) then explicitly asynchronous IO is natural, robust and surprisingly straightforward.

So go single threaded, go explicitly asynchronous APIs.  Be sane!

Notes

  1. williamedwardscoder posted this

 ↓ click the "share" button below!