Wednesday, May 30, 2012

Searching for substrings in a massive string

(Part 2 with stats from the human genome)

How might you search a really big string - a few billion symbols - for the best alignment of a large number of short substrings - perhaps just a few hundred symbols long or less - with some allowed edit-distance fuzzy-matching?

One approach jumped out at me: a sorted string index, like you build before doing a Burrows-Wheeler Transform (BWT)!

For illustration we’ll use a short string of 25 symbols:


The first step of BWT is to compute the sorted rotations of the text.  If you imagine adding special end-of-string symbol on the end of the string, you don’t need to actually compute the proper rotations, simply use a `memcmp`.  So we have an array of 25 offsets into the string, and sorted it looks like this:

This is called a suffix array.  This has O(n) memory and can be created in O(n lg n) or Θ(n) time.

Normally BWT now hops around this index and does the transforming before discarding the suffix array; however, we will not do so.  We will keep hold of this suffix array and use it as the index for our string-searching purposes.

(DNA searching tools have used BWT itself (bowtie) and FM-indices (bowtie2).  This is another approach, using rather more RAM in return for - hopefully - much faster exact matches)

If you now wanted to search for, say, `CATT`you could do so in this index with a binary search.

But some simple meta-data can help us cut that down.  If we count the frequency of each symbol and track the cumulative counts in sorted order:

Look at the cumulative counts - its the offset of into the index for each symbol!  So as `CATT` starts with `C`, and this frequency table tells us that `C` is 5 entries starting at offset 5 in the suffix array, we now only have to search this sub-section of the array.

Imagine we count the frequencies of all pairs of symbols in the string:

Look again at the cumulative counts - its the offset into the suffix array for each digraph.  As `CATT` starts with `CA`, we can easily see that there is just 1 `CA` entry at offset 5.  (Note that the tailing T at position 18 of the suffix array throws off subsequent indices; you’d have a minimum suffix length in your array to match the key length in your lookup table.)

Some digraphs may not occur, of course.  There’s no `AA`s in our string for example.  But in a very large string each digraph is probably represented.  No matter, we could easily allocate a `symbol_count*symbol_count` array for this index into the suffix array.  If we were to search for the substring `AACG`, we’d immediately spot there were 0 `AA`s and we could use ordinal math to know where to look in this index into the index without having to iterate over it.

You can derive the count for a prefix from the cumulative of it and the next prefix alone, so you only actually have to store the cumulative.

You can obviously extend this to longer and longer prefixes.  Imagine you know you have just four unique symbols.  At 2 bits per symbol, 8 symbols would have 64K entries.  At 32 bits for the cumulative, that’s a quarter of a megabyte to know where in array to look for any 8 symbol substring.  Go crazy - with 24-bit prefix you use 64MB to find any 12 symbol substring.

I used 32-bit numbers for the cumulative sums.  What if you have more than 4 billion symbols in your string?  You could use bigger numbers, or you could simply split your string into sections.  The second section could conveniently start max-substring-length from the end of the previous; this overlap would avoid you having to pick over substring matches that bridge your data.  Whether you need to divide your substring into sections or otherwise, you can search it in parallel (using many CPU cores at once) as search is a read-only operation.

Write the index to disk, and re-use it.  Open it with `mmap` and let the kernel cache hotter pages of it between runs.

The nice thing about the suffix array is that the offsets are in sequential memory and CPUs are good at sequential memory reads.  Once you’ve gone and read an element in the suffix array, the CPU has likely got its neighbours in the L1 cache line too.

Dereferencing these offsets into the source string is going to be out to main memory (or even disk) again, so is slightly costly.

Here’s where my mind is taking me:

You are searching for `ATTATGCC` in and your index into the suffix array is two symbols long (as in the example tables above).  `AT` is 4 entries starting at offset 1 in the suffix array.  Rather than going and dereferencing it (we are playing that we have a much longer source string so lots of random memory access is a bad thing), we instead look at the second digraph `TA``TA` is 1 entry at offset 18.  So, we just have to find those entries in the first suffix array sequence that match those in the second, and we’ve doubled our prefix length before dereferencing into the main source string.  I’ve thought along these lines before

I am still reasoning through what the order of the offsets in the table tell us without needing to dereference them too.

You could also track the offset in each string in the index that it differs from the previous; what would that buy you?

Or if you indexed on multi-symbol boundaries; say, if two-bits-per-symbol then, a byte is every forth symbol and is an obvious boundary to consider.

With fuzzy matching, you’d have to consider prefixes with the leading character missing and so on until your threshold is reached.  You’d have to consider prefixes with a classic edit-distance-type measurement (e.g. Smith-Waterman)

Hmm, what I really want to know is how big the human genome is and how the frequencies and max frequency of 24-symbol prefixes in it…

So, work in progress :)

Once I’ve thought this all through, it’ll be time to read the bowtie and bowtie 2 papers…

(Related issue on

 ↓ click the "share" button below!