Tuesday, June 3, 2014

Compressing Scrabble Dictionaries

The newest rec.math Al Zimmermann Programming Contest is a solitaire Scrabble puzzle called ‘Alphabet City’.

Wait, what do you mean you’re not entering that programming contest??  Its a really really nice clean problem and it’s addictive as 2048!

To lookup scrabble words quickly, people use a GADDAG.  The GADDAG is a tree data-structure.  Each node represents a character, and has pointers to all the child nodes that continue all possible words.

The key thing is that every rotation of each word is stored in the GADDAG.  To have a GADDAG of the words ALL, BALL and CALL, for each word we add all rotations.  This is LLA@, LA@L, A@LL for ALL, and LLAB@, LAB@L, AB@LL, B@ALL for BALL and so on.  (@ is my choice of ‘start-of-word’ marker; its a good choice because in ASCII its the character before A.  When you have arrays of child nodes [A..Z], this can easily be extended to have a slot pointing at the start-of-word node [@..Z].  And it looks nice enough in the graphs too.)

To speed up Scrabble searches, the prefix of each word is stored reversed which facilitates ‘hook’ searches.  To search for all words with one letter then A then another empty then L (?A?L), you start with the first known letter - A - and look for all those with one character in front (a child with a @ child), terminated by an L.

A naive sparse implementation of a GADDAG takes absolutely loads of memory: over a GB is not startling.  Almost all those nodes will be duplicates of other parts of the tree and almost all possible children will exist.  So we really want to get the memory usage down if we want to avoid our program waiting on main memory all the time: bottom cache is typically in the tens of nanoseconds away, whereas main memory is around a hundred nanoseconds away.  Getting the GADDAG small enough to fit into cache, even the bottom level cache (as in slowest and biggest), can mean an order of magnitude speedup!  So its well worth compressing your GADDAG.

The original paper describes compression and gives an example of partial compression.  I will talk these through with graphs:

image

Here’s the uncompressed tree.  Sorry it takes so much screen space .. it has a lot of nodes in it, which is a point I want to make :)

3 words, 11 rotations and 27 nodes.

Lets start compressing it:

The first thing to note is that, unlike the ALL, BALL, CALL example on NullWords blog, I am storing the terminating characters (the blue nodes) in a separate set off the node.  This is how the prefix LLA is followed by a single node with both B and C in it.

That aside, you can see a lot of obvious repetition in the tree.  The original paper had a recipe for a partial compression as you build the GADDAG: for each word, you merge the suffix of the rotations.  It looks like this:

image

We still have 25 nodes!  We’ve only saved 2 nodes.  This is not a fair example, however: in the full contest wordlist (172,820 words, 1,570,604 rotations and 32,13,156 nodes) this partial optimisation gets it down to just 1,534,313 nodes.

So how small do we want to go?  My laptop has a pretty decent 6MB of L3 cache.  If we can get the whole GADDAG in under 6MB, we’ll avoid hitting main memory ever.  (Cache line collisions ignored.)

So we can still see we have lots and lots of duplication.  The @ node is often followed by L, for example.  In order to merge these tails, we have to first build the uncompressed graph (partial compression during load helps speed things up though) and then compare nodes.  Nodes are the same if their children have the same values and their terminating character set is the same.  To find these, I simply sorted all nodes using a deep comparison.  Here’s all the identical tails merged:

image

Now we’re down to just 17 nodes!  You can still walk the tree in the same way: its the exact same tree, only duplication removed.

For the contest wordlist we now have only 533,173 nodes!  So 3x less nodes than the partial compression.  And a lose packing of these nodes and their links (using an int32 array, rather than byte-packing) weighs in at 6.3MB.  Just a bit too big for my L3 cache, so we need to compress more.

At this point its worth describing how to actually pack the tree into an array.  We want an array of int32 because we its nice and fast to access.  You can pack tighter if you go to bytes (e.g. address of other nodes can use just 3 bytes instead of 4), but this means some more twiddling while walking the tree: its worth it if you have only 2MB L2/L3 though!

So each node can be an int32.  The two high bits are boolean flags to say if the node has terminators and if the node has children.  If the node has terminators then the terminators are stored in the low 26 bits of the first int32, and any children will have bits set in the next int32.  The children int32 has a bit set for each present child.  And then the address of the children follow.

In the optimised tree there are 533,173 unique nodes with 979,160 links.  48,246 nodes have terminators and 137,905 don’t.  29,506 nodes have only one link and terminators, and 316,364 nodes have a single child link and no terminators. 1,152 nodes have only terminators.

As so many nodes have only one child link, you can pack this even tighter if you have a third high bit flag on the first word to mark that the node has only a single child, and then pack the ordinal of the child and its offset into the remainder of the header int32.

Here’s a diagram of my format:

image

Three types of node are shown.  The first three bits in the first int describe which fields are present.  If there are terminators, the bit mask of terminating characters is in the first int.  The bit mask for multiple children is in the first available int, which will also be the first if there are no terminators.  If a node has only one character, it can be packed into a single int.  You can of course have a node with a single child and some terminators, which takes two ints (not shown).

To determine if a node has a particular child is easy with a bit mask.  To work out the address of a child, though, you have to iterate over the bit mask of multiple children.  Modern fast CPUs have ops - available via C intrinsics like __builtin_ffs(), but you’re still walking them.  One idea is to actually use a different ordered alphabet: if instead of A=1, B=2, C=3 and so on you reorder the letters so that the most common children are sooner, you might get a small speed boost.

Often GADDAGs seem to be packed more tightly still (e.g. Quackle) by having a marker bit in each int32 to say if it is the first child pointer in node, or a continuation.

However, I  prefer bit-fields because it speeds up searching significantly.  As you walk over free tiles you can turn your rack into a single int32 mask which you can then AND with the bit-field in the header to see if letters in the rack are continuations of the word.

And if you have access to the C intrinsics you can mask out the higher bits of the child you want the index of and popcount() it.  This is branchless.  So it saves you doing lots of walking over child nodes doing if-statements on each one.

Still, if you have only 2MB L2, tighter packing may pay off big-time.  As fellow Mill CPU team Terje Mathisen says, "almost all programming can be viewed as an exercise in caching".

The thing is, we’re still missing a massive trick:

When you walk the tree, you know before visiting a child what its character is.  You’ll see above that I don’t actually store the character in the node.

So nodes that were originally built for completely different characters and yet have the same exit branches (and those exit branches have the same exit branches and so on) can be merged!  Feast your eyes on this beauty:

image

Now the ALL, BALL, CALL is down to just 12 nodes!  The red arrows show an exit vector that is different from the node that it arriving at; fundamentally, the next character is a function of the edge you travel not the node you arrive at.

Its hard to believe that actually this is the exact same GADDAG as the original diagram at the top of this post!  But its true.

For the contest word list, its now down to just 419,264 nodes and 811,985 links.

We can also - obviously - omit words in the dictionary that are longer than 15 characters, as the Scrabble board is only 15 characters.  In the final GADDAG there are 168,548 words, 1,497,605 rotations and 2,994,373 logically nodes.  But with compression, there are only 391,101 unique nodes and 774,216 links.  33,255 nodes have terminators and 122,012 don’t. 18,780 nodes have only one child link and terminators, and 216,702 have just the one link and no terminators.  This compresses to … under 4MB :D

Good luck in the contest you too ;)

Notes

  1. rilakkumania reblogged this from williamedwardscoder
  2. williamedwardscoder posted this

 ↓ click the "share" button below!