NOTE: I have since reassessed my assessment. There are other approaches I favour instead, just as long as this isn’t all brushed under the carpet. On CPython, for example, it is my understanding that the mitigation they have chosen is not enabled by default for 2.x users :( Anyway, there are worse ways to bring down my servers than sending them very large JSON payloads :)
Here’s the original article:
Very recently at the Chaos Computer Club Congress there was a really fascinating brilliant little attack on just about every webserver out there: hash collisions causing O(n^2) performance of the hash-table used to store POST requests (or HTTP headers, JSON parsers etc).
One solution might be to use a cryptographically secure hash function - there is nothing but brute force for the attacker to generate a collision, so with a nice long key and a big default table size you’ll be OK.
Another is to use a random hash function. This seems like a good way to avoid small-scale performance degradations (says someone knowledgeable I know).
Both these approaches think about the hash function, and trade absolute best-case performance for some security against a worst-case attack.
But to my mind, this is all about collision resolution.
Not all hash tables use buckets, but most do.
The problem is that the collisions get put in a linked list. A linked list is an OK bucket structure for the best/average case, and the attacker is making sure you have the worst case (and bad cache trashing to boot).
The linked list bucket is all about speed. It can be assumed that some thought has gone into picking simple and fast hashtable implementations for the normal input. Slowing down the hashtable implementation in the every-day use-case in order to avoid some attack is is going to be a hard sell. That’s why I gulp when DJB recommends just using crit-bit trees.
If, instead, the linked list got promoted to a balanced tree (using key value rather than hash) the moment it reached some small threshold in size, the best case performance would be the same, the worst case would be O(lg n).
This is a low-impact change. The people who implement hash-tables are the same people who implement trees, and they are very familiar with them; you wouldn’t anticipate any big bugs in it, or difficult maintenance or such down-sides.
And their unit-tests for it would be pathological input, so you’d get some idea of the real-world worst-case in real benchmarkable numbers from the whole exercise too.
Alternatively, if the oversized buckets were turned into child hash-tables using dynamic perfect hashing or something fancy instead, you’d have O(1) worst case.
I rather like the tree for buckets idea.
↓ click the "share" button below!