Indexing the genome - index assisted pattern matching, sorted list & hash table, subsequences vs substrings

6 minute read

Published:

The Boyer-Moore matching algorithm uses the bad character and the good suffix rules to skip futile alignments and speed up the search. It preprocesses the pattern P to build up look up tables for the two rules.

Another way to speed up the search is to preprocess the text T. Just like indices at the end of a book, the purpose of indexing/preprocessing T is to help us quickly locate the content of interest (pattern P) in T.

Matching/searching algorithms where T is preprocessed is called an offline search, otherwise it’s called an online search.

Note that this definition does not care if P is preprocessed or not, as algorithms that preprocesses P usually do it on the fly, whereas the preprocessing of T is usually done before hand and the indices stored somewhere for a quick look up or search. For example, the Boyer-Moore algorithm is an online searching algorithm, as it does not preprocess the text T.

In our read alignment problem, many sequencing reads need to be aligned to the same reference genome, so even it might take a long time to index the genome, the cost vanishes over billions of alignments.

Indexing approaches: sorted lists, hash tables

To index the reference genome, which is a string of text T we want to align the pattern P to, we need to pick a length of substring and chop T up to get all possible substrings in T, each substring is called a key, we need to store the keys and their locations/offsets for query the index.

If length of each key is k, it’s called a k-mer index. Here’s an example of building a 5-mer index for T, which is to get all possible substrings of T with length 5.

   0123456    
T: CGTGCGTGCTT
   CGTGCGTGCTT 
   CGTGCGTGCTT
   CGTGCGTGCTT
   CGTGCGTGCTT
   CGTGCGTGCTT
   CGTGCGTGCTT
   CGTGCGTGCTT

Index of T
CGTGC: 0, 4
GTGCG: 1
TGCGT: 2
GCGTG: 3
GTGCT: 5
TGCTT: 6

Two common structures for building indices are sorted lists and hash tables.

Sorted list

  • Sort keys alphabetically for quick searches/queries.
  • A sorted list allows us to perform a binary search and find the key in Log2(N) time.
  • The sorted list can be implemented in an array or a linked list, with each element of the list consisting of a key and a pointer to the corresponding data item.
Sort keys alphabetically
CGTGC: 0, 4
GCGTG: 3
GTGCG: 1
GTGCT: 5
TGCGT: 2
TGCTT: 6

Hash tables

  • The keys can also be stored as a hash table/ hash map/ dictionary.
  • The data structure allows constant O(1) look up time.

Index assisted exact search: query & verification

Once a k-mer index is built for the text T, for each pattern P, a match can be found by 2 steps.

  1. Take the first k characters of P (P[:k]), it has to match to one of the keys in the index. The matches/hits can be retrieved by querying the index.
  2. For each hit, verify if that the rest of P (P[k:]) is also matched with T. Say a hit has offset h, we need to compare whether P[k:] matches T[h+k: h+k+len(p)]

Note: it doesn’t have to be the first kmer in P for the query, the second, third, or any kmer in P would work the same, but for each hit, we still need to check whether the remaining part (all characters outside of that kmer we query) matches T.

Not all k-mers required: every nth kmer

When we record all k-mers (substrings) in the index, we can query using any (the 1st, 2nd, 3rd, etc.) k-mer of P, which indicates there may be some redundancy to the k-mer library - if we can utilize multiple kmers from P.

Every other k-mer

What if instead of every k-mer, we only record every other k-mer in T?

It means the k-mers are taken form either even offsets (0, 2, 4, 6, …) or odd offsets (1, 3, 5, 7, …). Let’s take the even offsets as an example.

  • When recording all k-mers of T (t_kmer), for each P, we can take any k-mer of P (p_kmer) and query and verify if it’s a match.
  • If the index only has k-mers with even offset, for a p_kmer, we lose the hits where p_kmer matches a t_kmer with an odd offset.
  • However, for each case where p_kmer matches a t_kmer with odd offset and eventually leads to a full match, the kmer right next to the p_kmer in P (say p_kmer’) would match to the kmer right next to t_kmer - which would be a t_kmer with an even offset.
  • So if we query both p_kmer and p_kmer’ (the kmer right next to it) against t_kmers with even offset, we can cover the matches yielded from p_kmer matching an odd offset t_kmer.

Note that p_kmer’ doesn’t need to be right next to p_kmer, just one of them needs to have a even offset (in P), and another to have an odd offset.

Every nth k-mer

In the generalized case of recording every nth k-mer in T, we need to check n different kmers in P, the offset of those p_kmers need to cover every module of n, i.e., there needs to be a p_kmer with offset % n equals to 0, another with offset % n equals to 1, etc., all the way to offset % n equals n - 1.

For example, we can take every 3rd t_3mer, and for each P, we need to query 3 p_3mers with offset mod 3 equals to 0, 1, 2, respectively, there we chose 3mers starting at offset 3, 1 and 5, respectively.

Every 3rd k-mer
       T: ∎∎∎∎∎∎∎∎∎∎∎∎∎∎
   Index: ∎∎∎   ∎∎∎   
every 3rd    ∎∎∎   ∎∎∎

       P: ∎∎∎∎∎∎∎∎
              ∎∎∎ 0 mod 3
   Query:  ∎∎∎ 1 mod 3
                ∎∎∎ 2 mod 3

Taking only every nth k-mer will make the index smaller and faster to query.

Subsequences vs substrings

So far we’ve taken strings - continuous chunks of k-mers.

More generally the index can be sliced with any pattern, like the following pattern of extracting characters at offset [0-1, 3, 5-6].

   012345678    
T: CGTGCGTGCTT
   CGTGCGTGCTT 
   CGTGCGTGCTT
   CGTGCGTGCTT
   ...
Sorted index of T
CGGGT: 0
CGGTT: 4
GCTCT: 3
GTCTG: 1
TGGGC: 2

T: CGTGCGTGCTT
P: GCGTACT

In this case we need to slice out the same pattern in P for query, like shown above.

Similarly, we can skip to take every nth pattern in T, and check every offset with every mod of n in P

Using subsequences instead of substrings allows us to match different locations in P at the same time, which leads to a higher successful rate during the verification steps.