Monday, September 2, 2013

Running SQL in your browser

As a key part of my Ludum Dare entry The NSA’s Where’s Snowden? game I wanted you - the elite NSA Analyst - to be able to hack the game world.  Imagine you could put your agents onto flights, or fiddle with the flight manifests so that targets were refused boarding or even arrested… and how would you do this?  I decided I wanted you to have SQL access to the world’s flight plan database…


But I couldn’t find any working SQL engines for Javascript!  I didn’t find the emscripten SQLite. The only thing I could find was TrimQuery, but I couldn’t get it to work and its quite old and not actively developed.  There’s are also a few LINQ-likes but they are not really SQL.

Now SQL is the lingua franca of structured data querying so I was surprised at this sorry state of affairs.  Whilst LINQ is nice (because of the VS tooling!) I feel that a lot of the LINQ-likes are driven by hipster cargo-culting and that a solid SQL that can query conventional JS objects and arrays would be a powerful and useful thing.

So I set out to build an SQL engine in Javascript (sql.js).  It was perhaps more fun to develop than the other parts of the game and more fun to examine than the game is to play :)  It amazes me what I got working in ~4 hours and under 500 lines of code.  This blog post describes how it works (and how it can be done better):

How to use sql.js

It consists of a single function SQL().  The parameters are:

the SQL statement,  e.g. “INSERT INTO …”

the definition of the tables, which is a Javascript object acting as a dictionary.  Each table is itself an object, detailing the column names and specifying any constraints on them (e.g. if they must be unique or can be null) and optionally a description.  E.g.
{ airports: { code:{description:”IATA code”,unique:true,notNull:true,}, name:{notNull:true,}, country:{notNull:true,}, category:{description:”level of CIA infiltration”,}, lat:{description:”latitude (decimal degrees)”,}, lng:{description:”longitude (decimal degrees)”,}, }, airlines: { code:{unique:true,notNull:true}, name:{}, country:{}, },…

table access; an object where each table name that the user can update has a boolean e.g.
( airports: false, flights: true,…
Any table name that is omitted is read-only.

The data; an object where each table is a key and each table is an array of objects e.g.
airports: [
{code:”LHR”,name:”London Heathrow”,…
manifests: [

params.  This object allows you to override things like setting the default row count LIMIT.

If the SQL cannot be parsed or if the table data does not have the expected fields then it throws an Error() exception.

Otherwise it returns an object with a columns key and a rows key.  The columns are an array of strings which are the column titles.  Rows are an array of arrays.

Parsing SQL

To parse a language you often use a parser generator like lex/flex yacc/bison.  There are parser generators for Javascript too - jison and peg and so on - but I haven’t used them and, to be honest, its quicker to just write our own parser from scratch.

To parse SQL we can eat the statement a token at a time.  A token might be a number, an identifier or a symbol like an opening brace.

Normally when I hand-roll a parser - its a bit of a recurring hobby of mine - I write the tokeniser to see what the next character is, and then based on whether that is a number, a quote mark or the beginning of an identifier or so on loop through until the token closes.  In sql.js however, I took a messy shortcut.

When asking for the next token, I pass into the tokeniser the list of terminating strings e.g. when consuming table names in a FROM clause I’d ask for the next token (which I expect to be a table name) which can be followed by a comma, the word AS or the word WHERE or a semi-colon.  If my engine supported it, I could just add those additional possible keywords that follow a table name in the FROM clause e.g. ORDER, LIMIT and HAVING and JOIN and so forth.

This meant that the next() function only had to do a Javascript string search - indexOf() - to find the nearest terminator.  However, this approach does have limitations e.g. it doesn’t properly parse strings (if the terminator occurs in the string the indexOf() will incorrectly find it) and words that contain keywords (the Airport table namehas OR in it, for example).  By the time you’ve added code to address this, you end up with a proper tokeniser anyway.  So in hindsight I should have doubled my line-count and just gone with a classic tokeniser in the first place.

Executing SQL

Once the query is parsed and the array of table names built and, its just to execute the query.  And this is surprisingly simple using the Cartesian Product.  If table A had two rowsand table B had two rows then the Cartesian Product would be:

A.V | B.V
1 | 1
1 | 2
2 | 1
2 | 2

No surprises there.  But what if we add a WHERE clause?

A.V | B.V
2 | 1
2 | 2

We can get that by first generating the Cartesian Product - simple by iteration and recursion - and then filtering the rows by evaluating the WHERE clause against each line!

This is thoroughly inefficient.  If there are 100 rows in one table and 100 rows in another then the Cartesian Product is (100+100)^2 = 40,000 rows!  It would be much more efficient to evaluate - and cull - candidate rows as soon as possible e.g. before recursing into every B row first check if A.V is 2.

In the NSA game it’s easy to be evaluating a Cartesian Product that is hundreds of thousands long, and this can take several seconds.

The engine supports SELECT, INSERT, UPDATE and DELETE.  It doesn’t support the JOIN keyword but it does let you join in the WHERE clause (as described above).  It doesn’t support IN.  It additionally has SHOW TABLES, DESCRIBE table and CHECK table which provide utility.  Because of the needs of the game there is also some rudimentary access control so the ‘hacker’ can’t really hack - and break - the game itself e.g. rescheduling or deleting flights and other havoc.

I made the whole thing case-insensitive but that could easily be removed for data values and comparisons themselves.

I added support for literals but the only literal I added was ‘NOW’.  I treated all dates as strings and this allowed simple Javascript string comparison of dates.  I didn’t support any functions, but this would be very easy to add too.

Improving performance

A proper SQL query planner (the code that decides the order that table rows are evaluated) has to understand the cost of retrieving rows (from disk and so on), but in Javascript RAM access speed is fairly equal and without indices you need full table spans so its just to filter tables by the parts of the WHERE clause that reference them as greedily as possible.

Each table is an array of objects.  It would be straightforward to support arrays of arrays, and objects of arrays and objects of objects.  Using objects as the collections themselves would effectively be a primary key - an index!  Knowledge of this could dramatically speed up joins.

The engine turns the incoming SQL into a chain of closures.  TrimQuery, by comparison, turns the SQL statement into a Javascript fragment that is then eval()ed.  I am not sure that is faster; doesn’t that cause the Javascript VM itself to abandon a lot of cached JITed code?  But it may be faster to instead compile the SQL down to a simple VM.

If you are running the same query again and again it would certainly help to ‘compile’ or ‘prepare’ the SQL statement once and have a way to run it again and again against data

sql.js is surprising short given what it can do, and surprisingly simple.  I found it easier to write SQL statements myself to work out what was happening in the model than to use array.filter() once joins were involved.  With a better tokeniser and an expanded grammar and a simple query planner it could be widely useful.

sql.js is doubtlessly more interesting than actually playing the game ;)


  1. williamedwardscoder posted this

 ↓ click the "share" button below!