Thursday, December 6, 2012

Making the History of Worlds Religions map

TL;DR compressing better than GZip, in Javascript, at the bottom ;)

An interactive, clickable map of the worlds religions is available on-line at http://williame.github.com/map_of_worlds_religions/

Here’s a time-lapse of it running, with the narrative (which you can mute in the interactive version) ↓

The real making of the map was excellent research by the usual suspect.  After his Enigma spreadsheet, and other side projects, he recently took it into his head to make a world religion time-line spreadsheet.  And my blog post about how we moved from a spreadsheet to a web-page can’t really hide that the hard work wasn’t technical, it was research.  But here’s a technical article:

My friend’s an Excel wizard, but he chose to use Sun/Open/LibreOffice Calc.  Oh boy what a mistake!

Of course he roped me in to critique and fix.

LibreOffice Calc sucks!

His first version had a macro that went through the cells on the map and looked up their religion of that year on the main data set and then looked that religion up to determine the colour.

for x in width:
    for y in height:
        religion = cell(x+year,y,data_sheet)
        colour = default_colour
        for i = 0; i < numReligions; i++:
            if cell(0,i,lookup_sheet).text == religion:
                colour = cell(0,i,lookup_sheet).text.background_colour
                break
        cell(x,y,map_sheet).background_colour = colour

The map is 70x80 (so 5,600 cells), and the most popular religions are at the top of the lookup list.

And yet that took 20 to 30+ seconds to run on modern laptops.

This is just wrong on so many levels.  The most inefficient code should do such a calculation in less than a second.  Excel 5 in Windows 3.1 on a 486 would have run rings around LibreOffice on this!

Being the wizard in Excel, my friend was flabbergasted by this shockingly poor performance by LibreOffice here.  This sloppy, inefficient code is not at all a problem in VBA.

He halved the time by skipping the 50% of cells that were Sea.  And it was still taking forever to change years.

So I dived in to speed it up… and surfaced red faced and making excuses :(

LibreOffice’s Basic editor is a hollow unloved backwater; absolutely nothing like the excellent VBA editor in Microsoft Office.

It doesn’t have any associative arrays (the undocumented ones I found mentioned in various confused forums didn’t work for us), so I couldn’t move the religion to colour mapping into code.  I simply couldn’t find a quick way to write the for-loops better.

The next step was to try and move the styling into formula instead using the T() and STYLE() functions and VLOOKUP()s etc.  This was much, much faster.  But it still took seconds.

Everything in LibreOffice takes seconds.  Opening documents takes seconds, recalculating takes seconds, messages that its recalculating column widths takes seconds.

I clicked around and wrote macros to do things like convert a colour to hex (the Basic even lacks a hex() function) and generally became familiar with LibreOffice Calc.  It feels like a dated and unresponsive Excel.

The more I used LibreOffice, the stronger my dislike for the whole copy-by-committee, no-obvious-pride-nor-perfection feel for it all took over.  Pity those sacrificing their souls for something so unloved.

Who settled Madagascar?

But by now I was being charmed by the data-set.  Did you know that it was Indonesians going west that settled Madagascar, rather than Africans going east?  I had no idea.  What Christianity was once the official religion off Mongolia?  What’s the religion of Switzerland?  I had no idea about most of it, in fact.

Rendering on the Web

My mind raced; even a noddy Python spreadsheet would compute it all near-instantly.  Yes, I know all about how spreadsheets calculate.  No, LibreOffice is not solving some massively big problem.  No, there are no excuses for being so painfully slow compared to all the other spreadsheets.  You even find people recommending to edit LibreOffice spreadsheets in other office suites!

Even Javascript would do it.  Aha!  What we needed in this day and age is a web page!

It was easy to copy-n-paste the LibreOffice worksheets into a text editor, where they appeared as tab-separated-value files.  This is a super-easy text format to parse in Javascript.

For drawing the map, I opted for HTML5 canvas 2D fillRect() and so on, rather than my usual route in to webGL.  My thinking was, I didn’t want to over-optimise it in the first cut!

And, of course, a normal HTML5 canvas can draw the whole map in the blink of an eye.

Narrative

My friend prefers to remain anonymous.  So we opted for text-to-speech like used for the intro to our Cage Flight game.

A quick toying with festival and espeak were really disappointing; they all sound like constipated daleks.  But it happens that Google has a text-to-speech (TTS) synthesizer in Google Translate in a non-public API.  If you feed it short snippets of text, you can get corresponding mp3 files which you can concatenate to make your total audio clip.  I borrowed a snippet from the web and made a conversion script.

Compressing the Map in Javascript

The problem, though, was the data size.  The main map data was nearly 5MB and github-pages doesn’t do gzipping.  (As its static content, github could pre-compress everything, store it compressed, and decompress to serve to clients not supporting compression…)

The data is a 5600 lines - one line for each of the 80x70 cells on the map.  And each line is a list of values, one per year on the timeline, saying what religion is in that cell in that year.  There are 91 years picked out on the time-line.

I Googled around and tried out some LZ77 and other Javascript parsers.  And they were all terribly slow.

I considered the PNG approach but, being interested in data compression, I decided that a super-simple home-made approach would be effective.

So I set myself the task of making a map data decompressor in Javascript in one hour.

I had some constraints.  I wanted it to be printable text, for one thing.  I wanted to keep the tab and new-line for splitting things up, and not for the actual map data.

So I assigned each religion a single printable character.  I then made a string from them, for each year, for each cell.

This was under half a megabyte, simply because one-character codes are massively shorter than the very redundant spelling out of each religion in the original tab-separated-value file.   80 x 70 x 91 = 509,600 bytes.

But I then run-length compressed this.  If three or more cells are the same religion in a row, I emit a special flag symbol and then printably-encode the number of repetitions in the run rather than emitting all the individual characters.

Its a very obvious approach.  And the resulting file was around 160K.

But this was still missing a lot of redundancy.  Most cells do not change from one year to the next, and their neighbouring cells do not change either.  So I replaced these non-changes with another, special religion code which means “same as last year”.  Technically, a form of delta encoding.  And now, when run through the run-length-encoder, the output is under 10K!

My hour wasn’t up so I added a few features such as support for longer runs by emitting two character run-length codes.   I didn’t make further obvious optimisations by simply omitting Sea nor doing optimal parsing - selecting between ‘same as last year’ and the religion code when it best helps compression etc.  I even left a few bytes on the table with my rather narrow run-length codes themselves.  Because I restrict myself to 7-bit printable characters (and don’t even use the full range of them), I’m leaving more than 1K of unused high-bits in there too.  In a one-hour project, you have to accept some imperfections if its functional.

But it was really worth it.  The browser can download a 12KB (with religion names, colours and year index too) of data and then unpack it in the blink of an eye.  The whole waiting for data message is dismissed in a flash.

How is that for ratio?  Well, lets do some rough comparisons:

The original data file (without religion names) is 4,803,202 bytes.

Gzipping gives 69,337 bytes.  1.443557859%.  Not bad!  Clearly this is an extremely redundant file.  If github-pages was supporting gzip, we’d never have had this problem to solve.

7zipping gives 31,648 bytes.  0.658893796%.  Better.

ZPAQing gives 13,160 bytes.  0.273983896%.  Good!

RLE delta, as we implemented in Javascript gives 9,958 bytes for the map data.  0.207320033%.  As there are 5600 cells in the map, that’s encoding the full time-line of 91 years for each cell in less than 2 bytes.

When you are sending a fifth of a percent of the original file, I’d call that brilliant :)

Of course, giving the half-a-megabyte of single-character map-data to Gzip, or better yet with the delta encoding too, would doubtless beat my puny RLE encoder; after all, proper compressors use entropy coding and all the bits in a byte and everything else!

I did play with tweaks to the algorithm a bit the day after, such as using unused printable characters to encode very short runs, and increasing the range of symbols used.  And the output was under 8K.  It felt a shame to say that it took two hours, though, so I never submitted it :)

My quickly tweaked and tidied up version gave 8,381 bytes.  0.174487768%.  But I’d slept on it.

UPDATE: After posting this, a couple of people took up the challenge to out-compress me!  Roy got furthest, and I hope he blogs about how!

I made some quick alterations that help compression: its best to order the map data in column order from bottom-right to top-left and the years from oldest to newest.  It makes a noticeable difference in the compressibility of this data set.

I then added a post-past compressor:

10 count occurrences of all characters and digraphs, and strings up to a certain length (e.g. 10 characters)
20 if all characters are unused, what is the cost of converting the least common character to be a digraph?
30 and what is the gain of replacing all the common substrings with a digraph, or a character?
40 perform the greediest replacement
50 go to 10 until no further gains

And the outcome is: 5,475 bytes (0.113986461%)

(Roy’s compression, when post-processed with this simple find-replace compressor, actually slightly improved on this!  But that’s his blog post to make, and it never made it into the map itself ;) )


 ↓ click the "share" button below!