Naive exact match is a simple algorithm to solve alignment problems in genomics.
The alignment problem
To find out the location of where a pattern p occurs in text t.
For example, the pattern “orl” has an exact match in text “Hello world.” at position 7.
P: orl T: Hello world. 012345678901
In genomics sequencing, the pattern would be a single DNA read from the sequencer, which can be ~150 bases long, and we need to find out where the read lies on the entire human genome, which is 3 billion bases long.
Naive exact match
As the name indicates, the naive exact match algorithm is a simple/naive/straight forward way to find exact matches of pattern p in text t.
It does so by going through each possible locations in text t, one by one, and check if the substring matches the pattern p.
P: orl T: Hello world. 0 orl 1 orl 2 orl 3 orl 4 orl 5 orl 6 orl 7 orl # Hello world. 8 orl 9 orl
Imagine a slider with the same length of string p, and it slides from the beginning to the end of the string t.
Note that the last substring of length P starts at position len(T) - len(P) + 1, after that position, a substring of len(P) would exceed the end of text T.
Implementation in python.
Coding practical: naive exact match
The coding project comes from the Algorithms for DNA Sequencing course from Johns Hopkins University.
My codes are available in this Google Colab file. Feel free to play around with it.
In this example, we use the naive exact match algorithm to find alignment of DNA substrings to a reference genome, and explore the cases where mismatches/substitutions (mutations) are allowed.
Alignments of DNA substring to ref genome
Find patterns that matches the given “lambda virus” genome sequence https://d28rh4a8wq0iu5.cloudfront.net/ads1/data/lambda_virus.fa
Q1. How many times does AGGT or its reverse complement ACCT occur in the lambda virus genome? E.g. if AGGT occurs 10 times and ACCT occurs 12 times, you should report 22.
Q2. How many times does TTAA or its reverse complement occur in the lambda virus genome? Hint: TTAA and its reverse complement are equal, so remember not to double count.
Q3. What is the offset of the leftmost occurrence of ACTAAGT or its reverse complement in the Lambda virus genome? E.g. if the leftmost occurrence of ACTAAGT is at offset 40 (0-based) and the leftmost occurrence of its reverse complement ACTTAGT is at offset 29, then report 29.
Q4. What is the offset of the leftmost occurrence of AGTCGA or its reverse complement in the Lambda virus genome?
Solution
We need to read the FastA file and store the entire genome as a string.
We need a naive exact match function that returns matching location of the patern p, given the sequence s.
To consider reverse complements, we need a function that turns a sequence string to its reverse complements, then get the reverse complements of both p and s. If s does not equal to its reverse complement, we’ll check the reverse complement matching and append the results.
Allowing mismatches
Q5. Make a new version of the naive function called naive_2mm that allows up to 2 mismatches per occurrence. Unlike for the previous questions, do not consider the reverse complement here. How many times does TTCAAGCC occur in the Lambda virus genome when allowing up to 2 mismatches?
Q6. What is the offset of the leftmost occurrence of AGGAGGTT in the Lambda virus genome when allowing up to 2 mismatches?
Solution
To allow mismatches, we’ll use a counter variable to record number of mismatches. When mismatches happen, instead of immediately break out of the comparing loop, we check if the number of mismatches has exceeded the threshold, and only break when it does (in this case >2).