Monday, November 11, 2019

Slow database? It might not be your fault

<rant>

Okay, it usually is your fault. If you logged the SQL your ORM was generating, or saw how you are doing joins in code, or realised what that indexed UUID does to your insert rate etc you’d probably admit it was all your fault. And the fault of your tooling, of course.

In my experience, most databases are tiny.  Tiny tiny.  Tables with a few thousand rows.  If your web app is slow, its going to all be your fault.  Stop building something webscale with microservices and just get things done right there in your database instead.  Etc.

But, quite often, each company has one or two databases that have at least one or two large tables.  Tables with tens of millions of rows.  I work on databases with billions of rows.  They exist.  And that’s the kind of database where your database server is underserving you.  There could well be a metric ton of actual performance improvements that your database is leaving on the table. Areas where your database server hasn’t kept up with recent (as in the past 20 years) of regular improvements in how programs can work with the kernel, for example.

Over the years I’ve read some really promising papers that have speeded up databases.  But as far as I can tell, nothing ever happens.  What is going on?

For example, your database might be slow just because its making a lot of syscalls.  Back in 2010, experiments with syscall batching improved MySQL performance by 40% (and lots of other regular software by similar or better amounts!).   That was long before spectre patches made the costs of syscalls even higher.

So where are our batched syscalls?  I can’t see a downside to them.  Why isn’t linux offering them and glib using them, and everyone benefiting from them?  It’ll probably speed up your IDE and browser too.

Of course, your database might be slow just because you are using default settings.  The historic defaults for MySQL were horrid.  Pretty much the first thing any innodb user had to do was go increase the size of buffers and pools and various incantations they find by googling.  I haven’t investigated, but I’d guess that a lot of the performance claims I’ve heard about innodb on MySQL 8 is probably just sensible modern defaults.

I would hold tokudb up as being much better at the defaults.  That took over half your RAM, and deliberately left the other half to the operating system buffer cache.

That mention of the buffer cache brings me to another area your database could improve.  Historically, databases did ‘direct’ IO with the disks, bypassing the operating system.  These days, that is a metric ton of complexity for very questionable benefit.  Take tokudb again: that used normal buffered read writes to the file system and deliberately left the OS half the available RAM so the file system had somewhere to cache those pages.  It didn’t try and reimplement and outsmart the kernel.

This paid off handsomely for tokudb because they combined it with absolutely great compression.  It completely blows the two kinds of innodb compression right out of the water.  Well, in my tests, tokudb completely blows innodb right out of the water, but then teams who adopted it had to live with its incomplete implementation e.g. minimal support for foreign keys.  Things that have nothing to do with the storage, and only to do with how much integration boilerplate they wrote or didn’t write.  (tokudb is being end-of-lifed by percona; don’t use it for a new project 😞) 

However, even tokudb didn’t take the next step: they didn’t go to async IO.  I’ve poked around with async IO, both for networking and the file system, and found it to be a major improvement.  Think how quickly you could walk some tables by asking for pages breath-first and digging deeper as soon as the OS gets something back, rather than going through it depth-first and blocking, waiting for the next page to come back before you can proceed.

I’ve gone on enough about tokudb, which I admit I use extensively.  Tokutek went the patent route (no, it didn’t pay off for them) and Google released leveldb and Facebook adapted leveldb to become the MySQL MyRocks engine.  That’s all history now.

In the actual storage engines themselves there have been lots of advances.  Fractal Trees came along, then there was a SSTable+LSM renaissance, and just this week I heard about a fascinating paper on B+ + LSM beating SSTable+LSM. A user called Jules commented, wondered about B-epsilon trees instead of B+, and that got my brain going too.  There are lots of things you can imagine an LSM tree using instead of SSTable at each level.

But how invested is MyRocks in SSTable?  And will MyRocks ever close the performance gap between it and tokudb on the kind of workloads they are both good at?

Of course, what about Postgres?  TimescaleDB is a really interesting fork based on Postgres that has a ‘hypertable’ approach under the hood, with a table made from a collection of smaller, individually compressed tables.  In so many ways it sounds like tokudb, but with some extra finesse like storing the min/max values for columns in a segment uncompressed so the engine can check some constraints and often skip uncompressing a segment.

Timescaledb is interesting because its kind of merging the classic OLAP column-store with the classic OLTP row-store.  I want to know if TimescaleDB’s hypertable compression works for things that aren’t time-series too?  I’m thinking ‘if we claim our invoice line items are time-series data…’

Compression in Postgres is a sore subject, as is out-of-tree storage engines generally.  Saying the file system should do compression means nobody has big data in Postgres because which stable file system supports decent compression?  Postgres really needs to have built-in compression and really needs to go embrace the storage engines approach rather than keeping all the cool new stuff as second class citizens.

Of course, I fight the query planner all the time.  If, for example, you have a table partitioned by day and your query is for a time span that spans two or more partitions, then you probably get much faster results if you split that into n queries, each for a corresponding partition, and glue the results together client-side!  There was even a proxy called ShardQuery that did that.  Its crazy.  When people are making proxies in PHP to rewrite queries like that, it means the database itself is leaving a massive amount of performance on the table.

And of course, the client library you use to access the database can come in for a lot of blame too.  For example, when I profile my queries where I have lots of parameters, I find that the mysql jdbc drivers are generating a metric ton of garbage in their safe-string-split approach to prepared-query interpolation.  It shouldn’t be that my insert rate doubles when I do my hand-rolled string concatenation approach.  Oracle, stop generating garbage!

This doesn’t begin to touch on the fancy cloud service you are using to host your DB.  You’ll probably find that your laptop outperforms your average cloud DB server.  Between all the spectre patches (I really don’t want you to forget about the syscall-batching possibilities!) and how you have to mess around buying disk space to get IOPs and all kinds of nonsense, its likely that you really would be better off perforamnce-wise by leaving your dev laptop in a cabinet somewhere.

Crikey, what a lot of complaining!  But if you hear about some promising progress in speeding up databases, remember it’s not realistic to hope the databases you use will ever see any kind of benefit from it.  The sad truth is, your database is still stuck in the 90s.  Async IO?  Huh no.  Compression?  Yeah right.  Syscalls?  Okay, that’s a Linux failing, but still!

Right now my hopes are on TimescaleDB.  I want to see how it copes with billions of rows of something that aren’t technically time-series.  That hybrid row and column approach just sounds so enticing.

Oh, and hopefully MyRocks2 might find something even better than SSTable for each tier?

But in the meantime, hopefully someone working on the Linux kernel will rediscover the batched syscalls idea…? ;)

Friday, March 30, 2018

Faster searches with non-prefix fields in composite indices

Here’s a horrid hack or neat trick to work around a gaping inefficiency in the MySQL query planner:

Imagine you have a very large table and a composite index e.g. the fields A, B, C and D.

Imagine you have a query WHERE A = ? AND C = ?

MySQL will use the index to check all the values of A but dereference each row with the matching A before checking the C.

You can see what is happening using EXPLAIN SELECT by looking at the KEY and KEY_LEN columns: the index will be named the KEY (good so far!) and the KEY_LEN will be the max size of column A (ugh!). 

If you supply a dummy clause constraining B too then you can force MySQL to check the C constraint before it dereferences any rows: SELECT … WHERE A = ? AND B != ? AND C = ? (where B can be checked against some value that it can’t possibly be).

You can confirm this is happening by looking that the EXPLAIN SELECT again: the KEY will still be the name of the index (if it isn’t, you can force the index to be used with a USE INDEX clause) and the KEY_LEN will now be the maximum length of the A, B and C fields combined.

You can even use this trick to get an index used even if the very first field in the composite index is not constrained, e.g. WHERE A != ? AND B = ?

This isn’t foolproof - you always need to benchmark with your database!  Using any index and the additional dereference cost compared to just doing a table scan in the first place assumes that the index is dramatically cheaper to walk and will dramatically reduce the number of rows that need to be dereferenced.  If only a tiny subset of the table needs to be dereferenced, the index will help; if a large proportion of the table is going to be dereferenced anyway then it’d be better to do a table scan from the get-go.

However, if you have very large tables and you have composite keys that cover your fields but which some of the composite keys unconstrained then its well worth evaluating this approach which, in the right circumstances, can give orders of magnitude improvements.

Why doesn’t MySQL use this trick internally?  Well, it should!  If its already decided to use a composite ordered index for some prefix of the constrained fields, then it should always check all constraints it can before dereferencing any actual rows. If MySQL used this trick you wouldn’t have to find some impossible value to fake a check against either.

Its a bit harder to say if it can work out when to use a composite index when there is no prefix at all.  And of course it doesn’t work with hash indices.

Does MariaDB use this trick internally automatically?  I don’t know but I doubt it.

Does PostgreSQL?  No idea.  Etc.

If you liked this you might like my recent stats on database compression or browse some of my earlier posts 

Tuesday, March 6, 2018

Why I am excited about RocksDB

TL;DR RocksDB will probably supplant TokuDB as my go-to backend for bigger-than-RAM datasets

A long time ago something amazing happened in database storage engines.  Fractal Trees inventors spun off a company called Tokutek to productify them.

The resulting TokuDB storage engine ran rings around all other engines.  Now it did this in part because they smudged the integration a bit; running the Tokutek-packaged TokuDB always massively outperformed a generic TokuDB packaged by Percona or MariaDB because the tweak wasn’t just to the storage layer but also, I believe, to the way MySQL integrated with storage engines?

Keep reading

Saturday, May 13, 2017

Compressing MySQL databases

TL;DR: use TokuDB.  Always.  It massively reduces storage requirements and is massively faster than InnoDB, even on low end hardware.  My tables were compressed to just 12% original size and that’s fairly typical.

A very simple structure:

CREATE TABLE test (
     date DATE NOT NULL,
     site VARCHAR(8) NOT NULL,
     order_id VARCHAR(64) NOT NULL,
     time TIME NOT NULL,
     kind ENUM(...) NOT NULL,
     account INTEGER,
     user INTEGER,
     client_price BIGINT,
     supplier_id INTEGER,
     supplier_cost BIGINT,
     client_ref VARCHAR(32),
     data JSON,
     PRIMARY KEY (date, site, order_id),
     INDEX (account, user, client_ref) ) DEFAULT CHARSET utf8
PARTITION BY RANGE (TO_DAYS(date)) ( ...

This system is using the RDBMS as a document store; the order items are being flattened into a document store (the ‘data’ field).

The ETL is upserting order events as they are processed, using multi-row INSERT INTO … ON DUPLICATE KEY UPDATE …

And I’ve been running performance tests with some real data.  27M rows of real data, all on the same day (so all in the same partition).  37M upsert statements inserting or updating those 27M rows, issued in batches of 1K-2K per multi-row statement.  This is enough rows to be able to get a grasp of upsert performance and extrapolate storage requirements.

The test box is an AWS t2.medium instance.  This is a very weak VM with 2 (virtual) CPUs and 4GB RAM.  Not normal DB fare, so performance beware.

The 27M rows take 10.48GB uncompressed.  It takes InnoDB 10 hours to load all the data.

Keep reading

Monday, December 12, 2016

Linux on my MBP sucks

I was reading the excellent commentary on LWN and was compelled to record my own experience so far.

I’m one of those being driven away by Apple.  I’ve been a happy Linux user for the past 18 years, but up until now I’ve been keeping my work’s MBP a OSX zone because MBPs have infamous Linux support and I just didn’t want to spend time on it.

For work I have a MBP retina 15-inch, Early 2013, 16GB RAM and NVIDIA GeForce GT 650M.  Still quite good hardware, and the thing I most need for my job would be more RAM, so upgrading to a newer mac is not gonna help me.

The biggest fight has been with brew and macports and trying to get the ecosystem of tools and servers I need to work well with it.  Its Unixish enough to get things working but not without a constant friction.  I increasingly give up and run stuff like in VirtualBox, but have noticed a staggering performance penalty that is finally becoming unbearable.

So I bit the bullet and set about upgrading to Linux.

I used the OSX disk partition tool to carve off a big chunk of the SSD to dedicate to Linux.  I get the impression that it wasn’t just truncating the partitions and wasn’t about to lose any data, and that was reassuring.  I think it moved stuff around to fit the new partition bounds and free up the empty space.

The instructions I found for Arch Linux would be off-putting to dabblers, but I was undaunted and worked through them.  The instructions worked just fine with a Ubuntu net-install stick as I had a wired network cable to hand.  I don’t think you can get a wireless install to work.

The dual-boot aspect works well; by default it boots into Xubuntu (my desktop of choice), but if I hold down the ALT key when I hear the startup chime it gives me a boot menu that lets me boot into OSX.

The default video drivers don’t offer hardware acceleration.  Being Xubuntu, you can go enable non-free NVIDIA drivers to get hardware acceleration.  However, I found this drank battery like a hummer, so I had to immediately turn that off.

I’ve noticed the WiFi signal strength is much poorer than in OSX, and I have to move my laptop around to get a good enough signal at a distance.

There are three fundamental problems, however:

1) a Linux desktop on a retina screen is like a postage stamp!  I played around trying to set scaling and everything, but most programs do not honour those settings.  Its unusable.

2) the touchpad support sucks.  Really really sucks.  Its unusable too.  I tried changing settings and couldn’t get anything I’m happy with.

3) a long time ago when I upgraded the major version of OSX it offered to encrypt my file system and I chose to do it.  At the time I couldn’t think of any reasons why not, but now that I dual boot I can’t access my OSX partition from Linux.  I think I found a FUSE driver that promised to support Apple Core Storage encryption, but I haven’t been able to find it again and try it out.

Sadly, I now routinely hold down the ALT key as my MBP boots.

Monday, June 13, 2016

My attack on the Enigma cipher machine

imagesee also:

My favourite book was Fermat’s Last Theorem by Simon Singh.  So when in 1999 I saw his new book - The Code Book - in shops I just had to buy it, despite it being incredibly expensive (for a very poor student like myself).  I then took my new prize procession with me expectantly on the summer holiday.  And when I reached the end of the book that I discovered it contained a Cipher Challenge!

I set about cracking the ciphers with gusto.  By chance I happened to have my 286 computer with me, and my hobby language of choice - Turbo Pascal 7. 

I think I skipped the ADFGVX cipher, so excited was I at the prospect of cracking the Enigma stage 8 of the contest.  Unfortunately, when I reached the Enigma, I was stumped.  I couldn’t make any headway with simulating the Enigma.  I wrongly assumed that the message would begin with the message key encrypted with the day key twice.  But my biggest mistake was not understanding how rings affected things.  I didn’t understand how rings affected things because The Code Book didn’t actually explain them!  The book instructed the intrepid explorer to search the Internet for more details and I didn’t take any heed of this because a) I couldn’t imagine the book wouldn’t actually give me everything I needed to crack the challenge, and b) I didn’t have any Internet.  The Enigma I was trying to crack wasn’t even an accurate Enigma.

So 17 years later I find myself on a whim tacking the Enigma once again.  And my computer is a bit faster too.  But I want to attack it my way, putting myself as much as possible in the shoes of my old self 17 years ago, only this time with Internet so I can double-check the correctness and completeness of my implementation of the Engima cipher machine.

This is my attack on the Enigma.  I will describe my attack on an M3 Enigma (used by the army and air force) and then I will generalize it to attack the M4 (used by the U-boats; what Alan Turing is famous for cracking).

Keep reading

Saturday, June 11, 2016

A brief history of the Enigma and the pre-war cracking of it

A lot of people now think that Alan Turing cracked the Enigma.  A lot of people like to correct those people and say that it was the Poles who truly cracked the Enigma.  So we really need to correct everyone and set things straight, because it was far more nuanced than that, with far more people involved, and different versions of Enigma were cracked even earlier.  If you read enough about code breaking you end up with a fuzzy timeline in your head pieced together from all the various sources.  Here’s what I’ve osmosed:

image

Keep reading

Tuesday, June 7, 2016

Fast Enigma sim in C++

Following on from my Python Enigma cipher machine simulator, I re-wrote it in C++ for maximum speed.

The settings for each rotor - its start position and ring setting - can actually be combined into a single offset. This is perhaps why all the popular treatments of the subject just ignore the impact of the rings on the number of combinations. In truth, their impact is hard to explain. The rings change when rotors rotate, and do have an impact on the number of combinations, but many initial starting positions are equivilent for short messages because the rings cancel out the rotor start position and do not cause a rotation because the message is too short.

The M4 Enigma - as infamously used by the U-boats - is usually described as a ‘four rotor’ machine. And that’s strictly true, I suppose, but I think I have a better way of describing it:

The M4 was a normal 3-rotor M3 Enigma chassis. A new 'thin’ reflector design was developed that could take a new, thinner 'greek’ non-rotating wheel. The thin greek wheel could be placed in any of 26 positions and clicked into the thin reflector. This could then be placed in the chassis in the normal reflector position.

So basically the M4 is an M3 Enigma with a configurable reflector.

I wrote the fast Enigma sim to serve as a basis of a statistical attack on the Enigma. In fact, if you look at my test vectors, you’ll notice one of them is Simon Singh’s Cipher Challenge Enigma stage 8, which I cracked using my statistical attack! More on that in the next post :)

And here’s the code. As you can see, the vast majority of the code is just test vectors to ensure its correctness:

Keep reading

Thursday, June 2, 2016

Enigma Machine on Paper and Python

(don’t forget to also read: my attack on the Enigma cipher machine)

It turns out my kids have been sending each other secret messages, enciphered with a substitution cipher of their own invention!  They only let me see the secret key when I agreed to help them mix up a very complicated recipe for invisible ink:

image

This reawakened fond memories of a young me getting a good way through Simon Singh’s The Code Book cipher challenge :)  (Also, see the Enigma Spreadsheet my friend made a few years ago!)

So my mind raced considering how fast todays laptops can brute-force, say, Enigma?  Even non-brute-force attacks are on a different scale if you have a computer.  For example, with Enigma, can you ignore the plugboard and go through every combination of day and message key, using a histogram to spot possible text that you then attack as a substitution cipher?

First I needed an accurate Enigma simulator.  To be honest, I found most descriptions a bit light on detail.  But I quickly found a paper Enigma, which I made:

image

From this, I was able to create a Python Enigma machine!  Source available here:

Keep reading

Thursday, May 26, 2016

Dividing and conquering summation

Here’s a very simple function:

char foo(char a, char b, char c, char d) {
    return a + b + c + d;
}

What do you think the optimal approach for adding four integers is?

One approach is to keep a running total and add each term in turn:

foo:
    addb    %dil, %sil
    addb    %dl, %sil
    addb    %cl, %sil
    movsbl  %sil, %eax
    retq

And that is exactly what Clang+LLVM 3.8 does!

When I encountered this I did a second take. The obvious way to get better instruction-level parallelism is to add a+b and add c+d, and then add the two partial sums. The CPU will be able to perform these adds in parallel because they do not depend on each other.

At this point I became very interested to know what other compilers do. I found an excellent online tool for showing what various compilers generate: godbolt the compiler explorer! https://gcc.godbolt.org/#

Here is what GCC does:

foo:
    leal    (%rcx,%rdx), %eax
    addl    %esi, %eax
    addl    %edi, %eax
    ret

This is dividing and conquering the summation. Here’s what Intel’s ICC compiler does too:

foo:
    addl      %esi, %edi
    addl      %ecx, %edx
    addl      %edx, %edi
    movl      %edi, %eax
    ret

So how could I have possibly found such a blatant missed optimisation in LLVM??!!? I have asked my friendly x86 mega-expert and it will be interesting to hear what he thinks of the code various compilers generate!

Tuesday, May 10, 2016

Duck typing

Old but great:

image
Monday, April 25, 2016

Ludum Dare #35 Mosaic

Its becoming a Ludum Dare tradition that I make a mosaic for each competition. I didn’t enter this time, but I still made a nice mosaic:

image
Saturday, February 20, 2016

Securing the iPhone

The FBI want a backdoor on the iPhone and they are trying to force and shame Apple publicly into giving them one.

The phone they want Apple to decrypt was used by a San Bernardino terrorist.  Surely everyone is rooting for the FBI on this one?!  As long as the news focuses on that part, public sentiment will be firmly on the FBI’s side on this one… bad bad Apple!

Of course, its just an excuse.  Its a very transparent excuse, but only if you look past the shock awe headlines:

The San Bernardino terrorist had two phones.  He carefully completely destroyed his personal phone before the attacks.  The remaining phone is his work phone, owned by the government (he was their employee).  The FBI say they want to see correspondence the attacker made with his coworkers.  But they can already get that information from the coworkers phones etc.  Just a cursory glance dispels the importance of getting into this one phone the attacker didn’t deem worth destroying.

The White House went so far as to say that the FBI specifically didn’t want a ‘backdoor’, they just wanted access to this one iPhone.

There must be using their own definition of backdoor then!

The key problem here is twofold:

* that as soon as the precedence is set, law enforcement will be able to get all phones unlocked and for increasingly minor offenses.  Within a year or two, councils in England will be insisting phones are unlocked when investigating school catchment area cheating.

* that it will inevitably leak.  The Chinese have used lax law enforcement security to breach Google before, and they’ll do it again.  They will get hold of it. And so will the Russians and the North Koreans and even Uzbekistan in its international war on dissidents.  The Chinese will get similar privileged access to pretty much all phones anyway as they actually manufacture them, and the Russians and every other bad-guy nation state will get into everything too.  Its just a given.

Eventually, muggers will be unlocking phones before resale to scrape them for passwords and pin-codes and see if there’s any electronic wallet swipe payment thing they can empty.

Both these aspects scare the life out of me.  But its the second point that is being discussed least, so its lets extrapolate into how that affects our business interests all over the world:

When I’ve travelled on business I’ve taken care to use encryption.  I’ve never been stopped at a border, but many people have.  And when you are stopped, you sit in a little interview room for an hour or two (if you’re lucky) and then you are allowed on your way.  And you know that they are busy photocopying every piece of paper you have and imaging your hard-disks.

The US are one of the worst offenders of this, but you can expect the same treatment in whatever country has business interests in whatever it is you are working on.  US businesspeople traveling to China should take precautions.

It is not about terrorism and drug smuggling.  Its about blatant commercial espionage, blessed by the governments.

Its not enough now for Apple to refuse to do it.  The only way forward is for Apple to ensure it cannot do it on future Apple products.  And everyone else must lock down their phones too.

All phones need Apple-style secure enclave.  And we need it so you cannot flash phones whilst they are locked.

This will protect phones that are turned off.  As you cross borders, turn your phone off.

Additionally, good phones need a duress pin-code.  By having hidden volumes with trucrypt-inspired plausible deniability we can give some measure of protection against the $5 wrench attacks as you transit through a hostile countries…

Thursday, January 28, 2016

New Security through Technology

I was just reading this horrible story of digital harassment about a family whose life is ruined by internet trolls their ‘hacker’ son fell out with.  Its not the first such story, and it won’t be the last.  I think this horror will grow rather than abate as more and more kids discover the rush of invincibility from hurting people remotely.  The youth of the world is getting online…

I’m too grown up to understand lulz.  Seems so sad.  Pity the perpetrators.  But due to jurisdiction and anonymity, I would be surprised if they can be brought to task for their crimes.

Now on a technical note, the perpetrators often seem to exercise terrible trade craft.  They often seem to be able to be tracked down, which surprises me.

Bruce says “we need to figure out how to identify perpetrators like this without destroying Internet privacy in the process”.  Here’s how I think we can do that:

Keep reading

Monday, January 4, 2016

Where are all the Raspberry Pi robots?

My daughters are really into robots.  They love going to the science camps during school holidays and the highlight is always programming the lego mindstorm robots.

So I got all excited about getting them a robot for xmas, and so I went looking.  My friend Roy recently blogged about robots (hacking a Roomba! and Makeblock) and this really wetted my appetite :)

In the end I got them a SparkFun Redbot inventor kit.  Its an Arduino with line followers, whiskers, accelerometer for bump detection and magnetic wheel rotation counters.

image

But its really bugging me how limited the Arduino is!

Here’s the Raspberry Pi Zero that I read about recently:

Keep reading