This is a note for the book Genome-Scale Algorithm Design: Biological Sequence Analysis in the Era of High-Throughput Sequencing (2nd edition).
Exercise 2.1
Consider the
fail(·)
function of the Morris–Pratt (MP) algorithm. We should devise a linear-time algorithm to compute it on the pattern to conclude the linear-time exact pattern matching algorithm. Show that one can modify the same MP algorithm so that on inputs P = p1p2 · · · pm and T = p2p3 · · · pm#m, where #m denotes a string of m concatenated endmarkers #, the valuesfail(2)
,fail(3)
, . . . ,fail(m)
can be stored on the fly before they need to be accessed.
Exercise 2.2
The Knuth–Morris–Pratt (KMP) algorithm is a variant of the MP algorithm with optimized
fail(·)
function: fail(i) = i’ , where i’ is largest such that p1p2 · · · pi‘ = pi−i’+1pi−i’+2 · · · pi , i’ < i, and pi’+1 != pi+1. This last condition makes the difference from the original definition. Assume you have thefail(·)
function values computed with the original definition. Show how to update these values in linear time to satisfy the KMP optimization.
Java implementation of the Knuth-Morris-Pratt (KMP) string searching algorithm:
package de.yanzhou.genome;
public class KMP {
static int[] buildPatternTable(String pattern) {
int[] table = new int[pattern.length()];
int j = 0;
for (int i = 1; i < pattern.length(); i++) {
while (j > 0 && pattern.charAt(i) != pattern.charAt(j)) {
j = table[j - 1];
}
if (pattern.charAt(i) == pattern.charAt(j)) {
j++;
}
table[i] = j;
}
return table;
}
static void search(String text, String pattern) {
int[] table = buildPatternTable(pattern);
int j = 0;
for (int i = 0; i < text.length(); i++) {
while (j > 0 && text.charAt(i) != pattern.charAt(j)) {
j = table[j - 1];
}
if (text.charAt(i) == pattern.charAt(j)) {
j++;
}
if (j == pattern.length()) {
System.out.println("Pattern found at index " + (i - j + 1));
j = table[j - 1];
}
}
}
public static void main(String[] args) {
String text = "ABABDABACDABABCABABAB";
String pattern = "ABAB";
search(text, pattern);
}
}
Result:
Pattern found at index 0
Pattern found at index 10
Pattern found at index 15
Pattern found at index 17